Re: SegFault on 9.6.14 - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: SegFault on 9.6.14
Date
Msg-id CAA4eK1L_TU2yQQem0FJUO+a9wB8F31W0fg-5U1Ug3nj74+uxAw@mail.gmail.com
Whole thread Raw
In response to Re: SegFault on 9.6.14  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: SegFault on 9.6.14
Re: SegFault on 9.6.14
List pgsql-hackers
On Wed, Jul 31, 2019 at 12:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 18, 2019 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think this is going in the wrong direction.  Nodes should *always*
> > assume that a rescan is possible until ExecEndNode is called.
> > If you want to do otherwise, you are going to be inventing a whole
> > bunch of complicated and doubtless-initially-buggy control logic
> > to pass down information about whether a rescan might be possible.
> > That doesn't sound like a recipe for a back-patchable fix.  Perhaps
> > we could consider redesigning the rules around REWIND in a future
> > version, but that's not where to focus the bug fix effort.
>
> So, if I can summarize how we got here, as best I understand it:
>

Thanks for the summarization.  This looks mostly correct to me.

> 0. The historic behavior of the executor is to assume it's OK to leak
> resources for the lifetime of the query.  Nodes that are executed to
> completion generally do some cleanup, but we feel free (as under
> Limit) to just stop executing a node without giving it any hint that
> it should release resources.  So a Sort may hold onto a terabyte of
> memory and an index scan may keep holding a pin even after there's no
> theoretical way of ever needing those resources again, and we just
> don't care.
>
> 1. Parallel query made that perhaps-already-shaky assumption a lot
> more problematic. Partly that's because workers are a a more scarce
> and considerably heavier resource than anything else, and moreover act
> as a container for anything else, so whatever you were leaking before,
> you can now leak N times more of it, plus N processes, until the end
> of the query. However, there's a correctness reason too, which is that
> when a node has a copy in the leader and a copy in every worker, each
> copy has its own instrumentation data (startup time, run time, nloops,
> etc) and we can only fold all that together once the node is done
> executing, because it's really hard to add up a bunch of numbers
> before the numbers are done changing.  We could've made the
> instrumentation shared throughout, but if we had, we could have
> contention for updating the instrumentation data, which seems like
> it'd be bad.
>
> 2. To fix that correctness problem, we decided to try to shut down the
> node under a limit node when we're done with it (commit
> 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this
> looks fundamentally necessary to me.  If you're going to have N
> separate copies of the instrumentation, and you want to add them up
> when you're done, then you have to decide to be done at some point;
> otherwise you don't know when to add them up, and maybe won't add them
> up at all, and then you'll be sad.  This does not mean that the exact
> timing couldn't be changed somehow, but if you want a correct
> implementation, you have to shut down Limit's sub-node after you're
> done executing it (so that you can get the instrumentation data from
> the workers after it's final) and before you start destroying DSM
> segments and stuff (so that you get the instrumentation data from the
> workers before it vanishes).
>
> 3. The aforementioned commit turned out to be buggy in at least to two
> ways, precisely because it didn't do a good enough job predicting when
> the Limit needed to be shut down.  First, there was commit
> 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact
> that you could hit the Limit and then back up.
>

We have not missed it, rather we decided to it separately because it
appears to impact some different cases as well [1][2].

>  Second, there's the
> present issue, where the Limit gets rescanned.
>
> So, given all that, if we want to adopt Tom's position that we should
> always cater to a possible rescan, then we're going to have to rethink
> the way that instrumentation data gets consolidated from workers into
> the leader in such a way that we can consolidate multiple times
> without ending up with the wrong answer.
>

The other idea we had discussed which comes closer to adopting Tom's
position was that during ExecShutdownNode, we just destroy parallel
workers, collect instrumentation data and don't destroy the parallel
context.  The parallel context could be destroyed in ExecEndNode
(ExecEndGather(Merge)) code path.  The problem with this idea is that
ExitParallelMode doesn't expect parallel context to be active.  Now,
we can either change the location of Exit/EnterParallelMode or relax
that restriction.  As mentioned above that restriction appears good to
me, so I am not in favor of changing it unless we have some other
solid way to install it.   I am not sure if this idea is better than
other approaches we are discussing.

>  The other option is to do
> what I understand Amit and Thomas to be proposing, which is to do a
> better job identifying the case where we're "done for good" and can
> trigger the shutdown fearlessly.
>

Yes, this sounds safe fix for back-branches.  We might want to go with
this for back-branches and then see if we can come up with a better
way to fix for HEAD.

[1] - https://www.postgresql.org/message-id/CA%2BTgmoYAxqmE13UOOSU%3DmE-hBGnTfYakb3dOoOJ_043Oc%3D6Xug%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1KwCx9qQk%3DKo4LFTwoYg9B8TSccPAc%3DEoJR88rQpCYVdA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Possible race condition in pg_mkdir_p()?
Next
From: Andres Freund
Date:
Subject: Re: Possible race condition in pg_mkdir_p()?