Re: SegFault on 9.6.14 - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: SegFault on 9.6.14
Date
Msg-id CA+hUKG+BO-inHQzeKVr4b9yxnUhGzgS=JKMiUZioyrULPV7P4Q@mail.gmail.com
Whole thread Raw
In response to Re: SegFault on 9.6.14  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: SegFault on 9.6.14
List pgsql-hackers
On Fri, Jul 26, 2019 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jul 23, 2019 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Right, that will be lesser code churn and it can also work.  However,
> > one thing that needs some thought is till now es_top_eflags is only
> > set in ExecutorStart and same is mentioned in comments where it is
> > declared and it seems we are going to change that with this idea.  How
> > about having a separate function ExecBlahShutdown which will clean up
> > resources as parallel context and can be called only from ExecutePlan
> > where we are calling ExecShutdownNode?  I think both these and the
> > other solution we have discussed are on similar lines and another idea
> > could be to relax the assert which again is not a superb idea.
>
> It seems we don't have a clear preference for any particular solution
> among these and neither there appears to be any better idea.  I guess
> we can wait for a few days to see if Robert has any views on this,
> otherwise, pick one of the above and move ahead.

I take the EXEC_FLAG_DONE idea back.  It's ugly and too hard to verify
that every appropriate path sets it, and a flag that means the
opposite would be even more of a kluge, and generally I think I was
looking at this too myopically: I was looking for a way to shut down
processes ASAP without giving up the shared memory we'll need for
rescanning, but what I should have been looking at is the reason you
did that in the first place: to get the instrumentation data.  Can you
explain why it's necessary to do that explicitly for Limit?  Wouldn't
the right place to collect instrumentation be at the end of execution
when Shutdown will run in all cases anyway (and possibly also during
ExecParallelReinitialize() or something like that if it's being
clobbered by rescans, I didn't check)?  What's special about Limit?

Today while poking at this and trying to answer those questions for
myself, I realised that  the repro I posted earlier[1] crashes exactly
as Jerry reported on REL9_6_STABLE, but in later release branches it
runs to completion.  That's because the crashing code was removed in
commit 41b0dd98 "Separate reinitialization of shared parallel-scan
state from ExecReScan.".

So newer branches get past that problem, but they all spit out tons of
each of these three warnings:

WARNING:  buffer refcount leak: [172] (rel=base/12558/16390,
blockNum=5, flags=0x93800000, refcount=1 2998)
...
WARNING:  relcache reference leak: relation "join_bar" not closed
...
WARNING:  Snapshot reference leak: Snapshot 0x7ff20383bfb0 still referenced
...

Oops.  I don't know exactly why yet, but the problem goes away if you
just comment out the offending ExecShutdownNode() call in nodeLimit.c.
I tried to understand whether the buffer stats were wrong with that
code commented out (Adrien Nayrat's original complaint[2]), but I ran
out of time for debugging adventures today.

[1] https://www.postgresql.org/message-id/CA%2BhUKGJyqDp9FZSHLTjiNMcz-c6%3DRdStB%2BUjVZsR8wfHnJXy8Q%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/86137f17-1dfb-42f9-7421-82fd786b04a1%40anayrat.info

-- 
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Thomas Munro
Date:
Subject: Re: LLVM compile failing in seawasp