Thread: [BUG] failed assertion in EnsurePortalSnapshotExists()
Hi hackers, I recently observed a failed assertion in EnsurePortalSnapshotExists(). The steps to reproduce the issue on the master branch are: create table bdt (a int primary key); insert into bdt values (1),(2); create table bdt2 (a int); insert into bdt2 values (1); Then launching: DO $$ BEGIN FOR i IN 1..2 LOOP BEGIN INSERT INTO bdt (a) VALUES (i); exception when unique_violation then update bdt2 set a = i; COMMIT; END; END LOOP; END; $$; Would produce: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost Due to: #2 0x0000000000b2ffcb in ExceptionalCondition (conditionName=0xd0d598 "portal->portalSnapshot == NULL", errorType=0xd0d0b3 "FailedAssertion", fileName=0xd0d174 "pquery.c", lineNumber=1785) at assert.c:69 #3 0x000000000099e666 in EnsurePortalSnapshotExists () at pquery.c:1785 From what i have seen, we end up having ActiveSnapshot set to NULL in AtSubAbort_Snapshot() (while we still have ActivePortal->portalSnapshot not being NULL and not set to NULL later on). That leads to ActiveSnapshotSet() not being true in the next call to EnsurePortalSnapshotExists() and leads to the failed assertion (checking that ActivePortal->portalSnapshot is NULL) later on in the code. Based on this, i have created the attached patch (which fixes the issue mentioned in the repro) but I have the feeling that I may have missed something more important here (that would not be addressed with the attached patch), thoughts? Thanks Bertrand
Attachment
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes: > I recently observed a failed assertion in EnsurePortalSnapshotExists(). Hmm, interesting. If I take out the "update bdt2" step, so that the exception clause is just COMMIT, then I get something different: ERROR: portal snapshots (1) did not account for all active snapshots (0) CONTEXT: PL/pgSQL function inline_code_block line 8 at COMMIT I think perhaps plpgsql's exception handling needs a bit of adjustment, but not sure what yet. regards, tom lane
Hi, On 9/27/21 9:44 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bdrouvot@amazon.com> writes: >> I recently observed a failed assertion in EnsurePortalSnapshotExists(). > Hmm, interesting. Thanks for looking at it! > If I take out the "update bdt2" step, so that the > exception clause is just COMMIT, then I get something different: > > ERROR: portal snapshots (1) did not account for all active snapshots (0) > CONTEXT: PL/pgSQL function inline_code_block line 8 at COMMIT FWIW, I just gave it a try and it looks like this is also "fixed" by the proposed patch. Does it make sense (as it is currently) to set the ActiveSnapshot to NULL and not ensuring the same is done for ActivePortal->portalSnapshot? Or does it mean we should not reach a state where we set ActiveSnapshot to NULL while ActivePortal->portalSnapshot is not already NULL? Thanks Bertrand
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes: > Does it make sense (as it is currently) to set the ActiveSnapshot to > NULL and not ensuring the same is done for ActivePortal->portalSnapshot? I think that patch is just a kluge :-( After tracing through this I've figured out what is happening, and why you need to put the exception block inside a loop to make it happen. The first iteration goes fine, and we end up with an empty ActiveSnapshot stack (and no portalSnapshots) because that's how the COMMIT leaves it. In the second iteration, we try to re-establish the portal snapshot, but at the point where EnsurePortalSnapshotExists is called (from the first INSERT command) we are already inside a subtransaction thanks to the plpgsql exception block. This means that the stacked ActiveSnapshot has as_level = 2, although the Portal owning it belongs to the outer transaction. So at the next exception, AtSubAbort_Snapshot zaps the stacked ActiveSnapshot, but the Portal stays alive and now it has a dangling portalSnapshot pointer. Basically there seem to be two ways to fix this, both requiring API additions to snapmgr.c (hence, cc'ing Alvaro for opinion): 1. Provide a variant of PushActiveSnapshot that allows the caller to specify the as_level to use, and then have EnsurePortalSnapshotExists, as well as other places that create Portal-associated snapshots, use this with as_level equal to the Portal's createSubid. The main hazard I see here is that this could theoretically allow the ActiveSnapshot stack to end up with out-of-order as_level values, causing issues. For the moment we could probably just add assertions to check that the passed as_level is >= next level, or maybe even that this variant is only called with empty ActiveSnapshot stack. 2. Provide a way for AtSubAbort_Portals to detect whether a portalSnapshot pointer points to a snap that's going to be deleted by AtSubAbort_Snapshot, and then just have it clear any portalSnapshots that are about to become dangling. (This'd amount to accepting the possibility that portalSnapshot is of a different subxact level from the portal, and dealing with the situation.) I initially thought #2 would be the way to go, but it turns out to be a bit messy since what we have is a Snapshot pointer not an ActiveSnapshotElt pointer. We'd have to do something like search the ActiveSnapshot stack looking for pointer equality to the caller's Snapshot pointer, which seems fragile --- do we assume as_snap is unique for any other purpose? That being the case, I'm now leaning to #1. Thoughts? regards, tom lane
Hi, On 9/28/21 6:50 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bdrouvot@amazon.com> writes: >> Does it make sense (as it is currently) to set the ActiveSnapshot to >> NULL and not ensuring the same is done for ActivePortal->portalSnapshot? > I think that patch is just a kluge :-( right, sounded too simple to be "true". > > After tracing through this I've figured out what is happening, and > why you need to put the exception block inside a loop to make it > happen. The first iteration goes fine, and we end up with an empty > ActiveSnapshot stack (and no portalSnapshots) because that's how > the COMMIT leaves it. In the second iteration, we try to > re-establish the portal snapshot, but at the point where > EnsurePortalSnapshotExists is called (from the first INSERT > command) we are already inside a subtransaction thanks to the > plpgsql exception block. This means that the stacked ActiveSnapshot > has as_level = 2, although the Portal owning it belongs to the > outer transaction. So at the next exception, AtSubAbort_Snapshot > zaps the stacked ActiveSnapshot, but the Portal stays alive and > now it has a dangling portalSnapshot pointer. Thanks for the explanation! > Basically there seem to be two ways to fix this, both requiring > API additions to snapmgr.c (hence, cc'ing Alvaro for opinion): > > 1. Provide a variant of PushActiveSnapshot that allows the caller > to specify the as_level to use, and then have > EnsurePortalSnapshotExists, as well as other places that create > Portal-associated snapshots, use this with as_level equal to the > Portal's createSubid. The main hazard I see here is that this could > theoretically allow the ActiveSnapshot stack to end up with > out-of-order as_level values, causing issues. For the moment we > could probably just add assertions to check that the passed as_level > is >= next level, or maybe even that this variant is only called with > empty ActiveSnapshot stack. I think we may get a non empty ActiveSnapshot stack with prepared statements, so tempted to do the assertion on the levels. > > 2. Provide a way for AtSubAbort_Portals to detect whether a > portalSnapshot pointer points to a snap that's going to be > deleted by AtSubAbort_Snapshot, and then just have it clear any > portalSnapshots that are about to become dangling. (This'd amount > to accepting the possibility that portalSnapshot is of a different > subxact level from the portal, and dealing with the situation.) > > I initially thought #2 would be the way to go, but it turns out > to be a bit messy since what we have is a Snapshot pointer not an > ActiveSnapshotElt pointer. We'd have to do something like search the > ActiveSnapshot stack looking for pointer equality to the caller's > Snapshot pointer, which seems fragile --- do we assume as_snap is > unique for any other purpose? > > That being the case, I'm now leaning to #1. Thoughts? I'm also inclined to #1. Please find attached a patch proposal for #1 that also adds a new test. Thanks Bertrand
Attachment
Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
Hi,
On 9/28/21 6:50 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> Does it make sense (as it is currently) to set the ActiveSnapshot to
>> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
> I think that patch is just a kluge :-(
right, sounded too simple to be "true".
>
> After tracing through this I've figured out what is happening, and
> why you need to put the exception block inside a loop to make it
> happen. The first iteration goes fine, and we end up with an empty
> ActiveSnapshot stack (and no portalSnapshots) because that's how
> the COMMIT leaves it. In the second iteration, we try to
> re-establish the portal snapshot, but at the point where
> EnsurePortalSnapshotExists is called (from the first INSERT
> command) we are already inside a subtransaction thanks to the
> plpgsql exception block. This means that the stacked ActiveSnapshot
> has as_level = 2, although the Portal owning it belongs to the
> outer transaction. So at the next exception, AtSubAbort_Snapshot
> zaps the stacked ActiveSnapshot, but the Portal stays alive and
> now it has a dangling portalSnapshot pointer.
Thanks for the explanation!
> Basically there seem to be two ways to fix this, both requiring
> API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):
>
> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid. The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues. For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.
I think we may get a non empty ActiveSnapshot stack with prepared
statements, so tempted to do the assertion on the levels.
>
> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling. (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
>
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer. We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?
>
> That being the case, I'm now leaning to #1. Thoughts?
I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?
PushActiveSnapshot(Snapshot snap, int as_level);
Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);
regards,
Ranier Vilela
Hi,
On 9/29/21 12:59 PM, Ranier Vilela wrote:
Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?Wouldn't one function be better?PushActiveSnapshot(Snapshot snap, int as_level);Sample calls:PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);
I would say because that could "break" existing extensions for example.
Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().
Thanks
Bertrand
Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
Hi,
On 9/29/21 12:59 PM, Ranier Vilela wrote:Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:I'm also inclined to #1.I have a stupid question, why duplicate PushActiveSnapshot?Wouldn't one function be better?PushActiveSnapshot(Snapshot snap, int as_level);Sample calls:PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);I would say because that could "break" existing extensions for example.
Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().
Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code unnecessarily, IMHO.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code unnecessarily, IMHO.
regards,
Ranier Vilela
On 9/29/21 1:23 PM, Ranier Vilela wrote:
Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:Hi,
On 9/29/21 12:59 PM, Ranier Vilela wrote:Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:I'm also inclined to #1.I have a stupid question, why duplicate PushActiveSnapshot?Wouldn't one function be better?PushActiveSnapshot(Snapshot snap, int as_level);Sample calls:PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);I would say because that could "break" existing extensions for example.
Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().
Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code unnecessarily, IMHO.
Right. I don't have a strong opinion about this.
Let's see what Tom, Alvaro or others arguments/opinions are (should they also want to go with option #1).
Thanks
Bertrand
On 2021-Sep-28, Tom Lane wrote: > 1. Provide a variant of PushActiveSnapshot that allows the caller > to specify the as_level to use, and then have > EnsurePortalSnapshotExists, as well as other places that create > Portal-associated snapshots, use this with as_level equal to the > Portal's createSubid. The main hazard I see here is that this could > theoretically allow the ActiveSnapshot stack to end up with > out-of-order as_level values, causing issues. For the moment we > could probably just add assertions to check that the passed as_level > is >= next level, or maybe even that this variant is only called with > empty ActiveSnapshot stack. I don't see anything wrong with this idea offhand. I didn't try to create scenarios with out-of-order as_level active snapshots, but with all the creativity out there in the world, and considering that it's possible to write procedures in C, I think that asserting that the order is maintained is warranted. Now if we do meet a case with out-of-order levels, what do we do? I suppose we'd need to update AtSubCommit_Snapshot and AtSubAbort_Snapshot to cope with that (but by all means let's wait until we have a test case where that happens ...) > 2. Provide a way for AtSubAbort_Portals to detect whether a > portalSnapshot pointer points to a snap that's going to be > deleted by AtSubAbort_Snapshot, and then just have it clear any > portalSnapshots that are about to become dangling. (This'd amount > to accepting the possibility that portalSnapshot is of a different > subxact level from the portal, and dealing with the situation.) > > I initially thought #2 would be the way to go, but it turns out > to be a bit messy since what we have is a Snapshot pointer not an > ActiveSnapshotElt pointer. We'd have to do something like search the > ActiveSnapshot stack looking for pointer equality to the caller's > Snapshot pointer, which seems fragile --- do we assume as_snap is > unique for any other purpose? I don't remember what patch it was, but I remember contemplating sometime during the past year the possibility of snapshots being used twice in the active stack. Now maybe this is not possible in practice because in most cases we create a copy, but I couldn't swear that that's always the case. I wouldn't rely on that. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)
On 2021-Sep-29, Ranier Vilela wrote: > Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> > escreveu: > > Adding a new function prevents "updating" existing extensions making use > > of PushActiveSnapshot(). > > > Valid argument of course. > But the extensions should also fit the core code. > Duplicating functions is very bad for maintenance and bloats the code > unnecessarily, IMHO. Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are updated in the patch. Given that six sevenths of the calls continue to use the existing function and that it is less verbose than the new one, that seems sufficient argument to keep it. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "How amazing is that? I call it a night and come back to find that a bug has been identified and patched while I sleep." (Robert Davidson) http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2021-Sep-29, Ranier Vilela wrote: >> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> >> escreveu: >> Duplicating functions is very bad for maintenance and bloats the code >> unnecessarily, IMHO. > Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are > updated in the patch. Given that six sevenths of the calls continue to > use the existing function and that it is less verbose than the new one, > that seems sufficient argument to keep it. Seeing that we have to back-patch this, changing the ABI of PushActiveSnapshot seems like a complete non-starter. The idea I'd had to avoid code duplication was to make PushActiveSnapshot a wrapper for the extended function: void PushActiveSnapshot(Snapshot snap) { PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel()); } This would add one function call to the common code path, but there are enough function calls in PushActiveSnapshot that I don't think that's a big concern. Another point is that this'd also add the as_level ordering assertion to the common code path, but on the whole I think that's good not bad. BTW, this is not great code: + if (ActiveSnapshot != NULL && ActiveSnapshot->as_next != NULL) + Assert(as_level >= ActiveSnapshot->as_next->as_level); You want it all wrapped in the Assert, so that there's not any code left in a non-assert build (which the compiler may or may not optimize away, perhaps after complaining about a side-effect-free statement). Actually, it's plain wrong, because you should be looking at the top as_level not the next one. So more like Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level); regards, tom lane
Em qua., 29 de set. de 2021 às 15:01, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2021-Sep-29, Ranier Vilela wrote:
>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
>> escreveu:
>> Duplicating functions is very bad for maintenance and bloats the code
>> unnecessarily, IMHO.
> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
> updated in the patch. Given that six sevenths of the calls continue to
> use the existing function and that it is less verbose than the new one,
> that seems sufficient argument to keep it.
Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.
The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:
void
PushActiveSnapshot(Snapshot snap)
{
PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}
Much better.
regards,
Ranier Vilela
On 9/29/21 8:01 PM, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> On 2021-Sep-29, Ranier Vilela wrote: >>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> >>> escreveu: >>> Duplicating functions is very bad for maintenance and bloats the code >>> unnecessarily, IMHO. >> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are >> updated in the patch. Given that six sevenths of the calls continue to >> use the existing function and that it is less verbose than the new one, >> that seems sufficient argument to keep it. > Seeing that we have to back-patch this, changing the ABI of > PushActiveSnapshot seems like a complete non-starter. > > The idea I'd had to avoid code duplication was to make > PushActiveSnapshot a wrapper for the extended function: > > void > PushActiveSnapshot(Snapshot snap) > { > PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel()); > } Implemented into the new attached patch. > So more like > > Assert(ActiveSnapshot == NULL || > snap_level >= ActiveSnapshot->as_level); Implemented into the new attached patch. But make check is now failing on join_hash.sql, I have been able to repro with: create table bdt (a int); begin; savepoint a; rollback to a; explain select count(*) from bdt; Which triggers a failed assertion on the new one: TRAP: FailedAssertion("ActiveSnapshot == NULL || as_level >= ActiveSnapshot->as_level" because we have as_level = 2 while ActiveSnapshot->as_level = 3. Thanks Bertrand
Attachment
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes: > But make check is now failing on join_hash.sql, I have been able to > repro with: Oh, duh, should have thought a bit harder. createSubid is a sequential subtransaction number; it's not the same as the as_level nesting level. Probably the most effective way to handle this is to add a subtransaction nesting-level field to struct Portal, so we can pass that. I don't recall that xact.c provides any easy way to extract the nesting level of a subtransaction that's not the most closely nested one. regards, tom lane
On 9/29/21 10:11 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bdrouvot@amazon.com> writes: >> But make check is now failing on join_hash.sql, I have been able to >> repro with: > Oh, duh, should have thought a bit harder. createSubid is a sequential > subtransaction number; it's not the same as the as_level nesting level. Oh right, thanks for the explanation. > > Probably the most effective way to handle this is to add a subtransaction > nesting-level field to struct Portal, so we can pass that. Agree, done that way in the new attached patch. Thanks Bertrand
Attachment
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes: > [ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ] Looking through this, I think you were overenthusiastic about applying PushActiveSnapshotWithLevel. We don't really need to use it except in the places where we're setting portalSnapshot, because other than those cases we don't have a risk of portalSnapshot becoming a dangling pointer. Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to snapshots that aren't created by the portal machinery itself, because we don't know very much about where passed-in snapshots came from or what the caller thinks their lifespan is. The attached revision therefore backs off to only using the new code in the two places where we really need it. I made a number of more-cosmetic revisions too. Notably, I think it's useful to frame the testing shortcoming as "we were not testing COMMIT/ROLLBACK inside a plpgsql exception block". So I moved the test code to the plpgsql tests and made it check ROLLBACK too. regards, tom lane PS: Memo to self: in the back branches, the new field has to be added at the end of struct Portal. diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 6597ec45a9..4cc38f0d85 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -4863,6 +4863,7 @@ CommitSubTransaction(void) AfterTriggerEndSubXact(true); AtSubCommit_Portals(s->subTransactionId, s->parent->subTransactionId, + s->parent->nestingLevel, s->parent->curTransactionOwner); AtEOSubXact_LargeObject(true, s->subTransactionId, s->parent->subTransactionId); diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c index b5797042ab..960f3fadce 100644 --- a/src/backend/tcop/pquery.c +++ b/src/backend/tcop/pquery.c @@ -480,7 +480,9 @@ PortalStart(Portal portal, ParamListInfo params, * We could remember the snapshot in portal->portalSnapshot, * but presently there seems no need to, as this code path * cannot be used for non-atomic execution. Hence there can't - * be any commit/abort that might destroy the snapshot. + * be any commit/abort that might destroy the snapshot. Since + * we don't do that, there's also no need to force a + * non-default nesting level for the snapshot. */ /* @@ -1136,9 +1138,15 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt, snapshot = RegisterSnapshot(snapshot); portal->holdSnapshot = snapshot; } - /* In any case, make the snapshot active and remember it in portal */ - PushActiveSnapshot(snapshot); - /* PushActiveSnapshot might have copied the snapshot */ + + /* + * In any case, make the snapshot active and remember it in portal. + * Because the portal now references the snapshot, we must tell + * snapmgr.c that the snapshot belongs to the portal's transaction + * level, else we risk portalSnapshot becoming a dangling pointer. + */ + PushActiveSnapshotWithLevel(snapshot, portal->createLevel); + /* PushActiveSnapshotWithLevel might have copied the snapshot */ portal->portalSnapshot = GetActiveSnapshot(); } else @@ -1784,8 +1792,13 @@ EnsurePortalSnapshotExists(void) elog(ERROR, "cannot execute SQL without an outer snapshot or portal"); Assert(portal->portalSnapshot == NULL); - /* Create a new snapshot and make it active */ - PushActiveSnapshot(GetTransactionSnapshot()); - /* PushActiveSnapshot might have copied the snapshot */ + /* + * Create a new snapshot, make it active, and remember it in portal. + * Because the portal now references the snapshot, we must tell snapmgr.c + * that the snapshot belongs to the portal's transaction level, else we + * risk portalSnapshot becoming a dangling pointer. + */ + PushActiveSnapshotWithLevel(GetTransactionSnapshot(), portal->createLevel); + /* PushActiveSnapshotWithLevel might have copied the snapshot */ portal->portalSnapshot = GetActiveSnapshot(); } diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c index 5c30e141f5..58674d611d 100644 --- a/src/backend/utils/mmgr/portalmem.c +++ b/src/backend/utils/mmgr/portalmem.c @@ -210,6 +210,7 @@ CreatePortal(const char *name, bool allowDup, bool dupSilent) portal->cleanup = PortalCleanup; portal->createSubid = GetCurrentSubTransactionId(); portal->activeSubid = portal->createSubid; + portal->createLevel = GetCurrentTransactionNestLevel(); portal->strategy = PORTAL_MULTI_QUERY; portal->cursorOptions = CURSOR_OPT_NO_SCROLL; portal->atStart = true; @@ -657,6 +658,7 @@ HoldPortal(Portal portal) */ portal->createSubid = InvalidSubTransactionId; portal->activeSubid = InvalidSubTransactionId; + portal->createLevel = 0; } /* @@ -940,6 +942,7 @@ PortalErrorCleanup(void) void AtSubCommit_Portals(SubTransactionId mySubid, SubTransactionId parentSubid, + int parentLevel, ResourceOwner parentXactOwner) { HASH_SEQ_STATUS status; @@ -954,6 +957,7 @@ AtSubCommit_Portals(SubTransactionId mySubid, if (portal->createSubid == mySubid) { portal->createSubid = parentSubid; + portal->createLevel = parentLevel; if (portal->resowner) ResourceOwnerNewParent(portal->resowner, parentXactOwner); } diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c index 2968c7f7b7..dca1bc8afc 100644 --- a/src/backend/utils/time/snapmgr.c +++ b/src/backend/utils/time/snapmgr.c @@ -678,10 +678,25 @@ FreeSnapshot(Snapshot snapshot) */ void PushActiveSnapshot(Snapshot snap) +{ + PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel()); +} + +/* + * PushActiveSnapshotWithLevel + * Set the given snapshot as the current active snapshot + * + * Same as PushActiveSnapshot except that caller can specify the + * transaction nesting level that "owns" the snapshot. This level + * must not be deeper than the current top of the snapshot stack. + */ +void +PushActiveSnapshotWithLevel(Snapshot snap, int snap_level) { ActiveSnapshotElt *newactive; Assert(snap != InvalidSnapshot); + Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level); newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt)); @@ -695,7 +710,7 @@ PushActiveSnapshot(Snapshot snap) newactive->as_snap = snap; newactive->as_next = ActiveSnapshot; - newactive->as_level = GetCurrentTransactionNestLevel(); + newactive->as_level = snap_level; newactive->as_snap->active_count++; diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h index 2e5bbdd0ec..37b1817ae9 100644 --- a/src/include/utils/portal.h +++ b/src/include/utils/portal.h @@ -130,6 +130,7 @@ typedef struct PortalData */ SubTransactionId createSubid; /* the creating subxact */ SubTransactionId activeSubid; /* the last subxact with activity */ + int createLevel; /* creating subxact's nesting level */ /* The query or queries the portal will execute */ const char *sourceText; /* text of query (as of 8.4, never NULL) */ @@ -219,6 +220,7 @@ extern void AtCleanup_Portals(void); extern void PortalErrorCleanup(void); extern void AtSubCommit_Portals(SubTransactionId mySubid, SubTransactionId parentSubid, + int parentLevel, ResourceOwner parentXactOwner); extern void AtSubAbort_Portals(SubTransactionId mySubid, SubTransactionId parentSubid, diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h index 44539fe15a..c6a176cc95 100644 --- a/src/include/utils/snapmgr.h +++ b/src/include/utils/snapmgr.h @@ -114,6 +114,7 @@ extern void InvalidateCatalogSnapshot(void); extern void InvalidateCatalogSnapshotConditionally(void); extern void PushActiveSnapshot(Snapshot snapshot); +extern void PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level); extern void PushCopiedSnapshot(Snapshot snapshot); extern void UpdateActiveSnapshotCommandId(void); extern void PopActiveSnapshot(void); diff --git a/src/pl/plpgsql/src/expected/plpgsql_transaction.out b/src/pl/plpgsql/src/expected/plpgsql_transaction.out index f79f847321..254e5b7a70 100644 --- a/src/pl/plpgsql/src/expected/plpgsql_transaction.out +++ b/src/pl/plpgsql/src/expected/plpgsql_transaction.out @@ -430,6 +430,34 @@ SELECT * FROM test1; ---+--- (0 rows) +-- test commit/rollback inside exception handler, too +TRUNCATE test1; +DO LANGUAGE plpgsql $$ +BEGIN + FOR i IN 1..10 LOOP + BEGIN + INSERT INTO test1 VALUES (i, 'good'); + INSERT INTO test1 VALUES (i/0, 'bad'); + EXCEPTION + WHEN division_by_zero THEN + INSERT INTO test1 VALUES (i, 'exception'); + IF (i % 3) > 0 THEN COMMIT; ELSE ROLLBACK; END IF; + END; + END LOOP; +END; +$$; +SELECT * FROM test1; + a | b +----+----------- + 1 | exception + 2 | exception + 4 | exception + 5 | exception + 7 | exception + 8 | exception + 10 | exception +(7 rows) + -- detoast result of simple expression after commit CREATE TEMP TABLE test4(f1 text); ALTER TABLE test4 ALTER COLUMN f1 SET STORAGE EXTERNAL; -- disable compression diff --git a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql index 888ddccace..8d76d00daa 100644 --- a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql +++ b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql @@ -354,6 +354,27 @@ $$; SELECT * FROM test1; +-- test commit/rollback inside exception handler, too +TRUNCATE test1; + +DO LANGUAGE plpgsql $$ +BEGIN + FOR i IN 1..10 LOOP + BEGIN + INSERT INTO test1 VALUES (i, 'good'); + INSERT INTO test1 VALUES (i/0, 'bad'); + EXCEPTION + WHEN division_by_zero THEN + INSERT INTO test1 VALUES (i, 'exception'); + IF (i % 3) > 0 THEN COMMIT; ELSE ROLLBACK; END IF; + END; + END LOOP; +END; +$$; + +SELECT * FROM test1; + + -- detoast result of simple expression after commit CREATE TEMP TABLE test4(f1 text); ALTER TABLE test4 ALTER COLUMN f1 SET STORAGE EXTERNAL; -- disable compression
On 9/30/21 7:16 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bdrouvot@amazon.com> writes: >> [ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ] > Looking through this, I think you were overenthusiastic about applying > PushActiveSnapshotWithLevel. We don't really need to use it except in > the places where we're setting portalSnapshot, because other than those > cases we don't have a risk of portalSnapshot becoming a dangling pointer. > Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to > snapshots that aren't created by the portal machinery itself, because > we don't know very much about where passed-in snapshots came from or > what the caller thinks their lifespan is. Oh right, I did not think about it, thanks! > > The attached revision therefore backs off to only using the new code > in the two places where we really need it. Ok, so in PortalRunUtility() and EnsurePortalSnapshotExists(). > I made a number of > more-cosmetic revisions too. thanks! > Notably, I think it's useful to frame > the testing shortcoming as "we were not testing COMMIT/ROLLBACK > inside a plpgsql exception block". So I moved the test code to the > plpgsql tests and made it check ROLLBACK too. Indeed, makes sense. > > regards, tom lane > > PS: Memo to self: in the back branches, the new field has to be > added at the end of struct Portal. out of curiosity, why? Thanks Bertrand
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes: > On 9/30/21 7:16 PM, Tom Lane wrote: >> PS: Memo to self: in the back branches, the new field has to be >> added at the end of struct Portal. > out of curiosity, why? Sticking it into the middle would create an ABI break for any extension code that's looking at struct Portal, due to movement of existing field offsets. In HEAD that's fine, so we should put the field where it makes the most sense. But we have to be careful about changing globally-visible structs in released branches. regards, tom lane
On 9/30/21 8:25 PM, Tom Lane wrote: > "Drouvot, Bertrand" <bdrouvot@amazon.com> writes: >> On 9/30/21 7:16 PM, Tom Lane wrote: >>> PS: Memo to self: in the back branches, the new field has to be >>> added at the end of struct Portal. >> out of curiosity, why? > Sticking it into the middle would create an ABI break for any > extension code that's looking at struct Portal, due to movement > of existing field offsets. In HEAD that's fine, so we should > put the field where it makes the most sense. But we have to > be careful about changing globally-visible structs in released > branches. Got it, thanks! Bertrand