Thread: [BUG] failed assertion in EnsurePortalSnapshotExists()

[BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
Hi hackers,

I recently observed a failed assertion in EnsurePortalSnapshotExists().

The steps to reproduce the issue on the master branch are:

create table bdt (a int primary key);
insert into bdt values (1),(2);
create table bdt2 (a int);
insert into bdt2 values (1);

Then launching:

DO $$
BEGIN
   FOR i IN 1..2 LOOP
     BEGIN
       INSERT INTO bdt (a) VALUES (i);
        exception when unique_violation then update bdt2 set a = i;
      COMMIT;
     END;
   END LOOP;
END;
$$;

Would produce:

server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
The connection to the server was lost

Due to:

#2  0x0000000000b2ffcb in ExceptionalCondition (conditionName=0xd0d598 
"portal->portalSnapshot == NULL", errorType=0xd0d0b3 "FailedAssertion", 
fileName=0xd0d174 "pquery.c", lineNumber=1785) at assert.c:69
#3  0x000000000099e666 in EnsurePortalSnapshotExists () at pquery.c:1785

 From what i have seen, we end up having ActiveSnapshot set to NULL in 
AtSubAbort_Snapshot() (while we still have ActivePortal->portalSnapshot 
not being NULL and not set to NULL later on).

That leads to ActiveSnapshotSet() not being true in the next call to 
EnsurePortalSnapshotExists() and leads to the failed assertion (checking 
that ActivePortal->portalSnapshot is NULL) later on in the code.

Based on this, i have created the attached patch (which fixes the issue 
mentioned in the repro) but I have the feeling that I may have missed 
something more important here (that would not be addressed with the 
attached patch), thoughts?

Thanks

Bertrand

Attachment

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
> I recently observed a failed assertion in EnsurePortalSnapshotExists().

Hmm, interesting.  If I take out the "update bdt2" step, so that the
exception clause is just COMMIT, then I get something different:

ERROR:  portal snapshots (1) did not account for all active snapshots (0)
CONTEXT:  PL/pgSQL function inline_code_block line 8 at COMMIT

I think perhaps plpgsql's exception handling needs a bit of adjustment,
but not sure what yet.

            regards, tom lane



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
Hi,

On 9/27/21 9:44 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> I recently observed a failed assertion in EnsurePortalSnapshotExists().
> Hmm, interesting.
Thanks for looking at it!
>   If I take out the "update bdt2" step, so that the
> exception clause is just COMMIT, then I get something different:
>
> ERROR:  portal snapshots (1) did not account for all active snapshots (0)
> CONTEXT:  PL/pgSQL function inline_code_block line 8 at COMMIT

FWIW, I just gave it a try and it looks like this is also "fixed" by the 
proposed patch.

Does it make sense (as it is currently) to set the ActiveSnapshot to 
NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

Or does it mean we should not reach a state where we set ActiveSnapshot 
to NULL while ActivePortal->portalSnapshot is not already NULL?

Thanks

Bertrand




Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
> Does it make sense (as it is currently) to set the ActiveSnapshot to 
> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?

I think that patch is just a kluge :-(

After tracing through this I've figured out what is happening, and
why you need to put the exception block inside a loop to make it
happen.  The first iteration goes fine, and we end up with an empty
ActiveSnapshot stack (and no portalSnapshots) because that's how
the COMMIT leaves it.  In the second iteration, we try to
re-establish the portal snapshot, but at the point where
EnsurePortalSnapshotExists is called (from the first INSERT
command) we are already inside a subtransaction thanks to the
plpgsql exception block.  This means that the stacked ActiveSnapshot
has as_level = 2, although the Portal owning it belongs to the
outer transaction.  So at the next exception, AtSubAbort_Snapshot
zaps the stacked ActiveSnapshot, but the Portal stays alive and
now it has a dangling portalSnapshot pointer.

Basically there seem to be two ways to fix this, both requiring
API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):

1. Provide a variant of PushActiveSnapshot that allows the caller
to specify the as_level to use, and then have
EnsurePortalSnapshotExists, as well as other places that create
Portal-associated snapshots, use this with as_level equal to the
Portal's createSubid.  The main hazard I see here is that this could
theoretically allow the ActiveSnapshot stack to end up with
out-of-order as_level values, causing issues.  For the moment we
could probably just add assertions to check that the passed as_level
is >= next level, or maybe even that this variant is only called with
empty ActiveSnapshot stack.

2. Provide a way for AtSubAbort_Portals to detect whether a
portalSnapshot pointer points to a snap that's going to be
deleted by AtSubAbort_Snapshot, and then just have it clear any
portalSnapshots that are about to become dangling.  (This'd amount
to accepting the possibility that portalSnapshot is of a different
subxact level from the portal, and dealing with the situation.)

I initially thought #2 would be the way to go, but it turns out
to be a bit messy since what we have is a Snapshot pointer not an
ActiveSnapshotElt pointer.  We'd have to do something like search the
ActiveSnapshot stack looking for pointer equality to the caller's
Snapshot pointer, which seems fragile --- do we assume as_snap is
unique for any other purpose?

That being the case, I'm now leaning to #1.  Thoughts?

            regards, tom lane



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
Hi,

On 9/28/21 6:50 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> Does it make sense (as it is currently) to set the ActiveSnapshot to
>> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
> I think that patch is just a kluge :-(
right, sounded too simple to be "true".
>
> After tracing through this I've figured out what is happening, and
> why you need to put the exception block inside a loop to make it
> happen.  The first iteration goes fine, and we end up with an empty
> ActiveSnapshot stack (and no portalSnapshots) because that's how
> the COMMIT leaves it.  In the second iteration, we try to
> re-establish the portal snapshot, but at the point where
> EnsurePortalSnapshotExists is called (from the first INSERT
> command) we are already inside a subtransaction thanks to the
> plpgsql exception block.  This means that the stacked ActiveSnapshot
> has as_level = 2, although the Portal owning it belongs to the
> outer transaction.  So at the next exception, AtSubAbort_Snapshot
> zaps the stacked ActiveSnapshot, but the Portal stays alive and
> now it has a dangling portalSnapshot pointer.

Thanks for the explanation!

> Basically there seem to be two ways to fix this, both requiring
> API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):
>
> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid.  The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues.  For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.
I think we may get a non empty ActiveSnapshot stack with prepared 
statements, so tempted to do the assertion on the levels.
>
> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling.  (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
>
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer.  We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?
>
> That being the case, I'm now leaning to #1.  Thoughts?

I'm also inclined to #1.

Please find attached a patch proposal for #1 that also adds a new test.

Thanks

Bertrand

Attachment

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Ranier Vilela
Date:
Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
Hi,

On 9/28/21 6:50 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> Does it make sense (as it is currently) to set the ActiveSnapshot to
>> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
> I think that patch is just a kluge :-(
right, sounded too simple to be "true".
>
> After tracing through this I've figured out what is happening, and
> why you need to put the exception block inside a loop to make it
> happen.  The first iteration goes fine, and we end up with an empty
> ActiveSnapshot stack (and no portalSnapshots) because that's how
> the COMMIT leaves it.  In the second iteration, we try to
> re-establish the portal snapshot, but at the point where
> EnsurePortalSnapshotExists is called (from the first INSERT
> command) we are already inside a subtransaction thanks to the
> plpgsql exception block.  This means that the stacked ActiveSnapshot
> has as_level = 2, although the Portal owning it belongs to the
> outer transaction.  So at the next exception, AtSubAbort_Snapshot
> zaps the stacked ActiveSnapshot, but the Portal stays alive and
> now it has a dangling portalSnapshot pointer.

Thanks for the explanation!

> Basically there seem to be two ways to fix this, both requiring
> API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):
>
> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid.  The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues.  For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.
I think we may get a non empty ActiveSnapshot stack with prepared
statements, so tempted to do the assertion on the levels.
>
> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling.  (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
>
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer.  We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?
>
> That being the case, I'm now leaning to #1.  Thoughts?

I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

regards,
Ranier Vilela

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:

Hi,

On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for example.

Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().

Thanks

Bertrand

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Ranier Vilela
Date:
Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:

Hi,

On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for example.

Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().

Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code unnecessarily, IMHO.

regards,
Ranier Vilela

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:


On 9/29/21 1:23 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:

Hi,

On 9/29/21 12:59 PM, Ranier Vilela wrote:

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <bdrouvot@amazon.com> escreveu:
I'm also inclined to #1.
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(), GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

I would say because that could "break" existing extensions for example.

Adding a new function prevents "updating" existing extensions making use of PushActiveSnapshot().

Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code unnecessarily, IMHO.

Right. I don't have a strong opinion about this.

Let's see what Tom, Alvaro or others arguments/opinions are (should they also want to go with option #1).

Thanks

Bertrand

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Alvaro Herrera
Date:
On 2021-Sep-28, Tom Lane wrote:

> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid.  The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues.  For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.

I don't see anything wrong with this idea offhand.

I didn't try to create scenarios with out-of-order as_level active
snapshots, but with all the creativity out there in the world, and
considering that it's possible to write procedures in C, I think that
asserting that the order is maintained is warranted.

Now if we do meet a case with out-of-order levels, what do we do?  I
suppose we'd need to update AtSubCommit_Snapshot and AtSubAbort_Snapshot
to cope with that (but by all means let's wait until we have a test case
where that happens ...)

> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling.  (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
> 
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer.  We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?

I don't remember what patch it was, but I remember contemplating
sometime during the past year the possibility of snapshots being used
twice in the active stack.  Now maybe this is not possible in practice
because in most cases we create a copy, but I couldn't swear that that's
always the case.  I wouldn't rely on that.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Alvaro Herrera
Date:
On 2021-Sep-29, Ranier Vilela wrote:

> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
> escreveu:

> > Adding a new function prevents "updating" existing extensions making use
> > of PushActiveSnapshot().
> >
> Valid argument of course.
> But the extensions should also fit the core code.
> Duplicating functions is very bad for maintenance and bloats the code
> unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch.  Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2021-Sep-29, Ranier Vilela wrote:
>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
>> escreveu:
>> Duplicating functions is very bad for maintenance and bloats the code
>> unnecessarily, IMHO.

> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
> updated in the patch.  Given that six sevenths of the calls continue to
> use the existing function and that it is less verbose than the new one,
> that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
    PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

This would add one function call to the common code path, but there
are enough function calls in PushActiveSnapshot that I don't think
that's a big concern.

Another point is that this'd also add the as_level ordering assertion
to the common code path, but on the whole I think that's good not bad.

BTW, this is not great code:

+    if (ActiveSnapshot != NULL && ActiveSnapshot->as_next != NULL)
+        Assert(as_level >= ActiveSnapshot->as_next->as_level);

You want it all wrapped in the Assert, so that there's not any code
left in a non-assert build (which the compiler may or may not optimize
away, perhaps after complaining about a side-effect-free statement).

Actually, it's plain wrong, because you should be looking at the
top as_level not the next one.  So more like

    Assert(ActiveSnapshot == NULL ||
           snap_level >= ActiveSnapshot->as_level);

            regards, tom lane



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Ranier Vilela
Date:
Em qua., 29 de set. de 2021 às 15:01, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2021-Sep-29, Ranier Vilela wrote:
>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
>> escreveu:
>> Duplicating functions is very bad for maintenance and bloats the code
>> unnecessarily, IMHO.

> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
> updated in the patch.  Given that six sevenths of the calls continue to
> use the existing function and that it is less verbose than the new one,
> that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
    PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

Much better.

regards,
Ranier Vilela

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
On 9/29/21 8:01 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> On 2021-Sep-29, Ranier Vilela wrote:
>>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <bdrouvot@amazon.com>
>>> escreveu:
>>> Duplicating functions is very bad for maintenance and bloats the code
>>> unnecessarily, IMHO.
>> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
>> updated in the patch.  Given that six sevenths of the calls continue to
>> use the existing function and that it is less verbose than the new one,
>> that seems sufficient argument to keep it.
> Seeing that we have to back-patch this, changing the ABI of
> PushActiveSnapshot seems like a complete non-starter.
>
> The idea I'd had to avoid code duplication was to make
> PushActiveSnapshot a wrapper for the extended function:
>
> void
> PushActiveSnapshot(Snapshot snap)
> {
>      PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
> }

Implemented into the new attached patch.

> So more like
>
>      Assert(ActiveSnapshot == NULL ||
>             snap_level >= ActiveSnapshot->as_level);

Implemented into the new attached patch.

But make check is now failing on join_hash.sql, I have been able to 
repro with:

create table bdt (a int);
begin;
savepoint a;
rollback to a;
explain select count(*) from bdt;

Which triggers a failed assertion on the new one:

TRAP: FailedAssertion("ActiveSnapshot == NULL || as_level >= 
ActiveSnapshot->as_level"

because we have as_level = 2 while ActiveSnapshot->as_level = 3.

Thanks

Bertrand

Attachment

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
> But make check is now failing on join_hash.sql, I have been able to 
> repro with:

Oh, duh, should have thought a bit harder.  createSubid is a sequential
subtransaction number; it's not the same as the as_level nesting level.

Probably the most effective way to handle this is to add a subtransaction
nesting-level field to struct Portal, so we can pass that.  I don't recall
that xact.c provides any easy way to extract the nesting level of a
subtransaction that's not the most closely nested one.

            regards, tom lane



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
On 9/29/21 10:11 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> But make check is now failing on join_hash.sql, I have been able to
>> repro with:
> Oh, duh, should have thought a bit harder.  createSubid is a sequential
> subtransaction number; it's not the same as the as_level nesting level.
Oh right, thanks for the explanation.
>
> Probably the most effective way to handle this is to add a subtransaction
> nesting-level field to struct Portal, so we can pass that.

Agree, done that way in the new attached patch.

Thanks

Bertrand

Attachment

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
> [ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ]

Looking through this, I think you were overenthusiastic about applying
PushActiveSnapshotWithLevel.  We don't really need to use it except in
the places where we're setting portalSnapshot, because other than those
cases we don't have a risk of portalSnapshot becoming a dangling pointer.
Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to
snapshots that aren't created by the portal machinery itself, because
we don't know very much about where passed-in snapshots came from or
what the caller thinks their lifespan is.

The attached revision therefore backs off to only using the new code
in the two places where we really need it.  I made a number of
more-cosmetic revisions too.  Notably, I think it's useful to frame
the testing shortcoming as "we were not testing COMMIT/ROLLBACK
inside a plpgsql exception block".  So I moved the test code to the
plpgsql tests and made it check ROLLBACK too.

            regards, tom lane

PS: Memo to self: in the back branches, the new field has to be
added at the end of struct Portal.

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6597ec45a9..4cc38f0d85 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4863,6 +4863,7 @@ CommitSubTransaction(void)
     AfterTriggerEndSubXact(true);
     AtSubCommit_Portals(s->subTransactionId,
                         s->parent->subTransactionId,
+                        s->parent->nestingLevel,
                         s->parent->curTransactionOwner);
     AtEOSubXact_LargeObject(true, s->subTransactionId,
                             s->parent->subTransactionId);
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index b5797042ab..960f3fadce 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -480,7 +480,9 @@ PortalStart(Portal portal, ParamListInfo params,
                  * We could remember the snapshot in portal->portalSnapshot,
                  * but presently there seems no need to, as this code path
                  * cannot be used for non-atomic execution.  Hence there can't
-                 * be any commit/abort that might destroy the snapshot.
+                 * be any commit/abort that might destroy the snapshot.  Since
+                 * we don't do that, there's also no need to force a
+                 * non-default nesting level for the snapshot.
                  */

                 /*
@@ -1136,9 +1138,15 @@ PortalRunUtility(Portal portal, PlannedStmt *pstmt,
             snapshot = RegisterSnapshot(snapshot);
             portal->holdSnapshot = snapshot;
         }
-        /* In any case, make the snapshot active and remember it in portal */
-        PushActiveSnapshot(snapshot);
-        /* PushActiveSnapshot might have copied the snapshot */
+
+        /*
+         * In any case, make the snapshot active and remember it in portal.
+         * Because the portal now references the snapshot, we must tell
+         * snapmgr.c that the snapshot belongs to the portal's transaction
+         * level, else we risk portalSnapshot becoming a dangling pointer.
+         */
+        PushActiveSnapshotWithLevel(snapshot, portal->createLevel);
+        /* PushActiveSnapshotWithLevel might have copied the snapshot */
         portal->portalSnapshot = GetActiveSnapshot();
     }
     else
@@ -1784,8 +1792,13 @@ EnsurePortalSnapshotExists(void)
         elog(ERROR, "cannot execute SQL without an outer snapshot or portal");
     Assert(portal->portalSnapshot == NULL);

-    /* Create a new snapshot and make it active */
-    PushActiveSnapshot(GetTransactionSnapshot());
-    /* PushActiveSnapshot might have copied the snapshot */
+    /*
+     * Create a new snapshot, make it active, and remember it in portal.
+     * Because the portal now references the snapshot, we must tell snapmgr.c
+     * that the snapshot belongs to the portal's transaction level, else we
+     * risk portalSnapshot becoming a dangling pointer.
+     */
+    PushActiveSnapshotWithLevel(GetTransactionSnapshot(), portal->createLevel);
+    /* PushActiveSnapshotWithLevel might have copied the snapshot */
     portal->portalSnapshot = GetActiveSnapshot();
 }
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 5c30e141f5..58674d611d 100644
--- a/src/backend/utils/mmgr/portalmem.c
+++ b/src/backend/utils/mmgr/portalmem.c
@@ -210,6 +210,7 @@ CreatePortal(const char *name, bool allowDup, bool dupSilent)
     portal->cleanup = PortalCleanup;
     portal->createSubid = GetCurrentSubTransactionId();
     portal->activeSubid = portal->createSubid;
+    portal->createLevel = GetCurrentTransactionNestLevel();
     portal->strategy = PORTAL_MULTI_QUERY;
     portal->cursorOptions = CURSOR_OPT_NO_SCROLL;
     portal->atStart = true;
@@ -657,6 +658,7 @@ HoldPortal(Portal portal)
      */
     portal->createSubid = InvalidSubTransactionId;
     portal->activeSubid = InvalidSubTransactionId;
+    portal->createLevel = 0;
 }

 /*
@@ -940,6 +942,7 @@ PortalErrorCleanup(void)
 void
 AtSubCommit_Portals(SubTransactionId mySubid,
                     SubTransactionId parentSubid,
+                    int parentLevel,
                     ResourceOwner parentXactOwner)
 {
     HASH_SEQ_STATUS status;
@@ -954,6 +957,7 @@ AtSubCommit_Portals(SubTransactionId mySubid,
         if (portal->createSubid == mySubid)
         {
             portal->createSubid = parentSubid;
+            portal->createLevel = parentLevel;
             if (portal->resowner)
                 ResourceOwnerNewParent(portal->resowner, parentXactOwner);
         }
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 2968c7f7b7..dca1bc8afc 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -678,10 +678,25 @@ FreeSnapshot(Snapshot snapshot)
  */
 void
 PushActiveSnapshot(Snapshot snap)
+{
+    PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
+}
+
+/*
+ * PushActiveSnapshotWithLevel
+ *        Set the given snapshot as the current active snapshot
+ *
+ * Same as PushActiveSnapshot except that caller can specify the
+ * transaction nesting level that "owns" the snapshot.  This level
+ * must not be deeper than the current top of the snapshot stack.
+ */
+void
+PushActiveSnapshotWithLevel(Snapshot snap, int snap_level)
 {
     ActiveSnapshotElt *newactive;

     Assert(snap != InvalidSnapshot);
+    Assert(ActiveSnapshot == NULL || snap_level >= ActiveSnapshot->as_level);

     newactive = MemoryContextAlloc(TopTransactionContext, sizeof(ActiveSnapshotElt));

@@ -695,7 +710,7 @@ PushActiveSnapshot(Snapshot snap)
         newactive->as_snap = snap;

     newactive->as_next = ActiveSnapshot;
-    newactive->as_level = GetCurrentTransactionNestLevel();
+    newactive->as_level = snap_level;

     newactive->as_snap->active_count++;

diff --git a/src/include/utils/portal.h b/src/include/utils/portal.h
index 2e5bbdd0ec..37b1817ae9 100644
--- a/src/include/utils/portal.h
+++ b/src/include/utils/portal.h
@@ -130,6 +130,7 @@ typedef struct PortalData
      */
     SubTransactionId createSubid;    /* the creating subxact */
     SubTransactionId activeSubid;    /* the last subxact with activity */
+    int            createLevel;    /* creating subxact's nesting level */

     /* The query or queries the portal will execute */
     const char *sourceText;        /* text of query (as of 8.4, never NULL) */
@@ -219,6 +220,7 @@ extern void AtCleanup_Portals(void);
 extern void PortalErrorCleanup(void);
 extern void AtSubCommit_Portals(SubTransactionId mySubid,
                                 SubTransactionId parentSubid,
+                                int parentLevel,
                                 ResourceOwner parentXactOwner);
 extern void AtSubAbort_Portals(SubTransactionId mySubid,
                                SubTransactionId parentSubid,
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 44539fe15a..c6a176cc95 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -114,6 +114,7 @@ extern void InvalidateCatalogSnapshot(void);
 extern void InvalidateCatalogSnapshotConditionally(void);

 extern void PushActiveSnapshot(Snapshot snapshot);
+extern void PushActiveSnapshotWithLevel(Snapshot snapshot, int snap_level);
 extern void PushCopiedSnapshot(Snapshot snapshot);
 extern void UpdateActiveSnapshotCommandId(void);
 extern void PopActiveSnapshot(void);
diff --git a/src/pl/plpgsql/src/expected/plpgsql_transaction.out b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
index f79f847321..254e5b7a70 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_transaction.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
@@ -430,6 +430,34 @@ SELECT * FROM test1;
 ---+---
 (0 rows)

+-- test commit/rollback inside exception handler, too
+TRUNCATE test1;
+DO LANGUAGE plpgsql $$
+BEGIN
+    FOR i IN 1..10 LOOP
+      BEGIN
+        INSERT INTO test1 VALUES (i, 'good');
+        INSERT INTO test1 VALUES (i/0, 'bad');
+      EXCEPTION
+        WHEN division_by_zero THEN
+            INSERT INTO test1 VALUES (i, 'exception');
+            IF (i % 3) > 0 THEN COMMIT; ELSE ROLLBACK; END IF;
+      END;
+    END LOOP;
+END;
+$$;
+SELECT * FROM test1;
+ a  |     b
+----+-----------
+  1 | exception
+  2 | exception
+  4 | exception
+  5 | exception
+  7 | exception
+  8 | exception
+ 10 | exception
+(7 rows)
+
 -- detoast result of simple expression after commit
 CREATE TEMP TABLE test4(f1 text);
 ALTER TABLE test4 ALTER COLUMN f1 SET STORAGE EXTERNAL; -- disable compression
diff --git a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
index 888ddccace..8d76d00daa 100644
--- a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
+++ b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
@@ -354,6 +354,27 @@ $$;
 SELECT * FROM test1;


+-- test commit/rollback inside exception handler, too
+TRUNCATE test1;
+
+DO LANGUAGE plpgsql $$
+BEGIN
+    FOR i IN 1..10 LOOP
+      BEGIN
+        INSERT INTO test1 VALUES (i, 'good');
+        INSERT INTO test1 VALUES (i/0, 'bad');
+      EXCEPTION
+        WHEN division_by_zero THEN
+            INSERT INTO test1 VALUES (i, 'exception');
+            IF (i % 3) > 0 THEN COMMIT; ELSE ROLLBACK; END IF;
+      END;
+    END LOOP;
+END;
+$$;
+
+SELECT * FROM test1;
+
+
 -- detoast result of simple expression after commit
 CREATE TEMP TABLE test4(f1 text);
 ALTER TABLE test4 ALTER COLUMN f1 SET STORAGE EXTERNAL; -- disable compression

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
On 9/30/21 7:16 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> [ v2-0003-EnsurePortalSnapshotExists-failed-assertion.patch ]
> Looking through this, I think you were overenthusiastic about applying
> PushActiveSnapshotWithLevel.  We don't really need to use it except in
> the places where we're setting portalSnapshot, because other than those
> cases we don't have a risk of portalSnapshot becoming a dangling pointer.
> Also, I'm quite nervous about applying PushActiveSnapshotWithLevel to
> snapshots that aren't created by the portal machinery itself, because
> we don't know very much about where passed-in snapshots came from or
> what the caller thinks their lifespan is.

Oh right, I did not think about it, thanks!

>
> The attached revision therefore backs off to only using the new code
> in the two places where we really need it.

Ok, so in PortalRunUtility() and EnsurePortalSnapshotExists().

>    I made a number of
> more-cosmetic revisions too.
thanks!
> Notably, I think it's useful to frame
> the testing shortcoming as "we were not testing COMMIT/ROLLBACK
> inside a plpgsql exception block".  So I moved the test code to the
> plpgsql tests and made it check ROLLBACK too.
Indeed, makes sense.
>
>                          regards, tom lane
>
> PS: Memo to self: in the back branches, the new field has to be
> added at the end of struct Portal.

out of curiosity, why?

Thanks

Bertrand




Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
Tom Lane
Date:
"Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
> On 9/30/21 7:16 PM, Tom Lane wrote:
>> PS: Memo to self: in the back branches, the new field has to be
>> added at the end of struct Portal.

> out of curiosity, why?

Sticking it into the middle would create an ABI break for any
extension code that's looking at struct Portal, due to movement
of existing field offsets.  In HEAD that's fine, so we should
put the field where it makes the most sense.  But we have to
be careful about changing globally-visible structs in released
branches.

            regards, tom lane



Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

From
"Drouvot, Bertrand"
Date:
On 9/30/21 8:25 PM, Tom Lane wrote:
> "Drouvot, Bertrand" <bdrouvot@amazon.com> writes:
>> On 9/30/21 7:16 PM, Tom Lane wrote:
>>> PS: Memo to self: in the back branches, the new field has to be
>>> added at the end of struct Portal.
>> out of curiosity, why?
> Sticking it into the middle would create an ABI break for any
> extension code that's looking at struct Portal, due to movement
> of existing field offsets.  In HEAD that's fine, so we should
> put the field where it makes the most sense.  But we have to
> be careful about changing globally-visible structs in released
> branches.

Got it, thanks!

Bertrand