Re: SegFault on 9.6.14 - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: SegFault on 9.6.14 |
Date | |
Msg-id | CALDaNm0ryAiW1caO0yhDxt-Q5WRjQJWfZn6ZmSRc827+UjKCgw@mail.gmail.com Whole thread Raw |
In response to | Re: SegFault on 9.6.14 (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: SegFault on 9.6.14
Re: SegFault on 9.6.14 |
List | pgsql-hackers |
On Wed, Jul 31, 2019 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 31, 2019 at 12:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Jul 18, 2019 at 9:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I think this is going in the wrong direction. Nodes should *always* > > > assume that a rescan is possible until ExecEndNode is called. > > > If you want to do otherwise, you are going to be inventing a whole > > > bunch of complicated and doubtless-initially-buggy control logic > > > to pass down information about whether a rescan might be possible. > > > That doesn't sound like a recipe for a back-patchable fix. Perhaps > > > we could consider redesigning the rules around REWIND in a future > > > version, but that's not where to focus the bug fix effort. > > > > So, if I can summarize how we got here, as best I understand it: > > > > Thanks for the summarization. This looks mostly correct to me. > > > 0. The historic behavior of the executor is to assume it's OK to leak > > resources for the lifetime of the query. Nodes that are executed to > > completion generally do some cleanup, but we feel free (as under > > Limit) to just stop executing a node without giving it any hint that > > it should release resources. So a Sort may hold onto a terabyte of > > memory and an index scan may keep holding a pin even after there's no > > theoretical way of ever needing those resources again, and we just > > don't care. > > > > 1. Parallel query made that perhaps-already-shaky assumption a lot > > more problematic. Partly that's because workers are a a more scarce > > and considerably heavier resource than anything else, and moreover act > > as a container for anything else, so whatever you were leaking before, > > you can now leak N times more of it, plus N processes, until the end > > of the query. However, there's a correctness reason too, which is that > > when a node has a copy in the leader and a copy in every worker, each > > copy has its own instrumentation data (startup time, run time, nloops, > > etc) and we can only fold all that together once the node is done > > executing, because it's really hard to add up a bunch of numbers > > before the numbers are done changing. We could've made the > > instrumentation shared throughout, but if we had, we could have > > contention for updating the instrumentation data, which seems like > > it'd be bad. > > > > 2. To fix that correctness problem, we decided to try to shut down the > > node under a limit node when we're done with it (commit > > 85c9d3475e4f680dbca7c04fe096af018f3b8760). At a certain level, this > > looks fundamentally necessary to me. If you're going to have N > > separate copies of the instrumentation, and you want to add them up > > when you're done, then you have to decide to be done at some point; > > otherwise you don't know when to add them up, and maybe won't add them > > up at all, and then you'll be sad. This does not mean that the exact > > timing couldn't be changed somehow, but if you want a correct > > implementation, you have to shut down Limit's sub-node after you're > > done executing it (so that you can get the instrumentation data from > > the workers after it's final) and before you start destroying DSM > > segments and stuff (so that you get the instrumentation data from the > > workers before it vanishes). > > > > 3. The aforementioned commit turned out to be buggy in at least to two > > ways, precisely because it didn't do a good enough job predicting when > > the Limit needed to be shut down. First, there was commit > > 2cd0acfdade82f3cab362fd9129d453f81cc2745, where we missed the fact > > that you could hit the Limit and then back up. > > > > We have not missed it, rather we decided to it separately because it > appears to impact some different cases as well [1][2]. > > > Second, there's the > > present issue, where the Limit gets rescanned. > > > > So, given all that, if we want to adopt Tom's position that we should > > always cater to a possible rescan, then we're going to have to rethink > > the way that instrumentation data gets consolidated from workers into > > the leader in such a way that we can consolidate multiple times > > without ending up with the wrong answer. > > > > The other idea we had discussed which comes closer to adopting Tom's > position was that during ExecShutdownNode, we just destroy parallel > workers, collect instrumentation data and don't destroy the parallel > context. The parallel context could be destroyed in ExecEndNode > (ExecEndGather(Merge)) code path. The problem with this idea is that > ExitParallelMode doesn't expect parallel context to be active. Now, > we can either change the location of Exit/EnterParallelMode or relax > that restriction. As mentioned above that restriction appears good to > me, so I am not in favor of changing it unless we have some other > solid way to install it. I am not sure if this idea is better than > other approaches we are discussing. > > I have made a patch based on the above lines. I have tested the scenarios which Thomas had shared in the earlier mail and few more tests based on Thomas's tests. I'm not sure if we will be going ahead with this solution or not. Let me know your opinion on the same. If you feel this approach is ok, we can add few of this tests into pg tests. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: