Re: BUG #16811: Severe reproducible server backend crash - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #16811: Severe reproducible server backend crash
Date
Msg-id 3965163.1612382840@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #16811: Severe reproducible server backend crash  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
I wrote:
> I can reproduce it without anything extra.  What's needed is to run
> the problematic statement in extended query mode, which you can
> do like this:

> $ cat foo.sql
> do $$ begin rollback; end $$;

> $ pgbench -n -f foo.sql -M prepared
> pgbench: error: client 0 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing

I spent some more time poking at this today.  Thomas' intuition that
commit 1cff1b95a is related was right: if you bisect with this test
case, it'll finger that as the first bad commit.  But I think that's
blaming the messenger.  What's happening is that in extended-query
mode, the Portal containing the DO operation does not have its own
copy of the parser-output statement list; portal->stmts is pointing
at a list belonging to a CachedPlan.  And if the DO decides to roll
back the current transaction, we get to PortalReleaseCachedPlan
which does

        /*
         * We must also clear portal->stmts which is now a dangling reference
         * to the cached plan's plan list.  This protects any code that might
         * try to examine the Portal later.
         */
        portal->stmts = NIL;

"Examining the portal later" is not the problem though.  Way back up
the call stack is PortalRunMulti, which is iterating over that same
statement list, and is now doing so with a dangling pointer.  So
the crash we're seeing occurs because portal->stmts no longer matches
the list pointer that PortalRunMulti has.  But the dangling-pointer
hazard existed before that.  There are two reasons that explain why
we don't crash pre-v13:

* A transaction abort that didn't cause longjmp out of PortalRunMulti
could only have happened inside a CALL or DO block.  Those are
utility statements, so there is no way for them to be part of a
multi-statement list.  Therefore there's no hazard of PortalRunMulti
trying to execute additional statements within the same portal (which
I doubt would work if it did try).

* Again because this must be a utility command, the CachedPlan we
are using must be a generic one, therefore its CachedPlanSource
is holding a refcount on it.  Thus, when PortalReleaseCachedPlan
releases the Portal's refcount, the CachedPlan's refcount does not
go to zero and so it doesn't get freed, and thus the "dangling"
list pointer can still be dereferenced to discover that indeed
we are at the end of the list.

This all seems like a house of cards: it really gives me the
willies that we're allowing AtAbort_Portals to release resources
belonging to an active portal.  However, I don't see any near-term
way around that.  If it doesn't release resources then the transaction
abort mechanisms will complain (and release the resources anyway).

So I think that Thomas' "cargo cult" patch is on the right track:
what we have to do is make PortalRunMulti robust against the
possibility that the portal was zapped underneath it.  But I feel
that we ought to prevent it from trying to iterate the foreach()
as well as not doing the lnext(), so I'm intending to add
something like

    if (portal->stmts == NIL)
        break;

rather than only skipping the lnext().  We can reorder the loop
steps to do the lnext() at the bottom, and then this'll work easily.

Also, even though we don't see this crash before v13, I want to
back-patch further.  Accessing a cache entry we no longer have any
refcount for is just asking for trouble, even if it accidentally
fails to fail right now.

            regards, tom lane



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #16852: Latest postgres:10-alpine Docker image error
Next
From: PG Bug reporting form
Date:
Subject: BUG #16853: Materialized view not behaving in fully MVCC-compliant way