Re: SegFault on 9.6.14 - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: SegFault on 9.6.14 |
Date | |
Msg-id | CAA4eK1LaW+hE7N_ztef+xhF8XW0Y_fu_3psPY1-OyMAGkOZV0A@mail.gmail.com Whole thread Raw |
In response to | Re: SegFault on 9.6.14 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: SegFault on 9.6.14
|
List | pgsql-hackers |
On Fri, Aug 9, 2019 at 6:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 7, 2019 at 5:45 AM vignesh C <vignesh21@gmail.com> wrote: > > I have made a patch based on the above lines. > > I have tested the scenarios which Thomas had shared in the earlier > > mail and few more tests based on Thomas's tests. > > I'm not sure if we will be going ahead with this solution or not. > > Let me know your opinion on the same. > > If you feel this approach is ok, we can add few of this tests into pg tests. > > I think this patch is bizarre: > > - It introduces a new function called getParallelModeLevel(), which is > randomly capitalized different from the other functions that do > similar things, and then uses it to do the same thing that could have > been done with the existing function IsInParallelMode(). > - It contains an "if" statement whose only content is an Assert(). > Don't write if (a) Assert(b); write Assert(!a || b). > - It contains zero lines of comment changes, which is obviously not > enough for a patch that proposes to fix a very thorny issue. This > failure has two parts. First, it adds no new comments to explain the > bug being fixed or the theory of operation of the new code. Second, it > does not even bother updating existing comments that are falsified by > the patch, such as the function header comments for > ExecParallelCleanup and ExecShutdownGather. > - It changes what ExecParallelCleanup does while adjusting only one of > the two callers to match the behavior change. nodeGatherMerge.c > manages to be completed untouched by this patch. If you change what a > function does, you really need to grep for all the calls to that > function and adjust all callers to match the new set of expectations. > > It's a little hard to get past all of those issues and look at what > the patch actually does, but I'm going to try: the theory of operation > of the patch seems to be that we can skip destroying the parallel > context when performing ExecParallelCleanup and in fact when exiting > parallel mode, and then when we get to executor end time the context > will still be there and we can fish the instrumentation out of it. But > this seems problematic for several reasons. For one thing, as Amit > already observed, the code currently contains an assertion which > ensure that a ParallelContext can't outlive the time spent in parallel > mode, and it doesn't seem desirable to relax that assertion (this > patch removes it). > > But beyond that, the issue here is that the Limit node is shutting > down the Gather node too early, and the right fix must be to stop > doing that, not to change the definition of what it means to shut down > a node, as this patch does. So maybe a possible approach here - which > I think is more or less what Tom is proposing - is: > > 1. Remove the code from ExecLimit() that calls ExecShutdownNode(). > Attached patch does that. I have also added one test as a separate patch so that later if we introduce shutting down resources in Limit node, we don't break anything. As of now, I have kept it separate for easy verification, but if we decide to go with this approach and test appears fine, we can merge it along with the fix. > 2. Adjust ExecutePlan() so that it ensures that ExecuteShutdownNode() > gets called at the very end of execution, at least when execute_once > is set, before exiting parallel mode. > I am not sure if I completely understand this point. AFAICS, the ExecuteShutdownNode is called whenever we are done getting the tuples. One place where it is not there in that function is when we assume destination is closed, basically below code: ExecutePlan() { .. if (!dest->receiveSlot(slot, dest)) break; .. } Do you expect this case to be also dealt or you have something else in mind? The other possibility could be that we move the shutdown of the node at the end of the function when we exit parallel mode but doing that lead to some regression failure on my machine. I will investigate the same. > 3. Figure out, possibly at a later time or only in HEAD, how to make > the early call to ExecLimit() in ExecShutdownNode(), and then put it > back. Okay, Tom has suggested a design to address this, but that will be for HEAD only. To be clear, I am not planning to spend time on that at this moment, but OTOH, I want the bug reported in this thread to be closed, so for now, we need to proceed with some minimum fix as mentioned by you in above two points. If someone else can write a patch, I can help in the review of same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: