Re: SegFault on 9.6.14 - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: SegFault on 9.6.14
Date
Msg-id CAA4eK1KXZwH+3ZsQi7PMQo53o+tDkK_gi=ASbPThAySsUg4cYA@mail.gmail.com
Whole thread Raw
In response to Re: SegFault on 9.6.14  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: SegFault on 9.6.14
List pgsql-hackers
On Thu, Oct 17, 2019 at 10:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Sep 13, 2019 at 1:35 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Sep 12, 2019 at 8:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Robert, Thomas, do you have any more suggestions related to this.  I
> > > am planning to commit the above-discussed patch (Forbid Limit node to
> > > shutdown resources.) coming Monday, so that at least the reported
> > > problem got fixed.
> >
> > I think that your commit message isn't very clear about what the
> > actual issue is.  And the patch itself doesn't add any comments or
> > anything to try to clear it up. So I wouldn't favor committing it in
> > this form.
>
> Is the proposed commit message at the bottom of this email an improvement?
>
> Do I understand correctly that, with this patch, we can only actually
> lose statistics in the case where we rescan?
>

No, it will lose without rescan as well.  To understand in detail, you
might want to read the emails pointed by me in one of the above email
[1] in this thread.

>  That is, precisely the
> case that crashes (9.6) or spews warnings (10+)?  In a quick
> non-rescan test with the ExecShutdownNode() removed, I don't see any
> problem with the buffer numbers on my screen:
>

Try by removing aggregate function.  Basically, the Limit node has to
finish before consuming all the rows sent by a parallel node beneath
it.

>
> ===
> Don't shut down Gather[Merge] early under Limit.
>
> Revert part of commit 19df1702f5.
>
> Early shutdown was added by that commit so that we could collect
> statistics from workers, but unfortunately it interacted badly with
> rescans.  Rescanning a Limit over a Gather node could produce a SEGV
> on 9.6 and resource leak warnings on later releases.  By reverting the
> early shutdown code, we might lose statistics in some cases of Limit
> over Gather, but that will require further study to fix.
>

How about some text like below?  I have added slightly different text
to explain the reason for the problem.

"Early shutdown was added by that commit so that we could collect
statistics from workers, but unfortunately, it interacted badly with
rescans.  The problem is that we ended up destroying the parallel
context which is required for rescans.  This leads to rescans of a
Limit node over a Gather node to produce unpredictable results as it
tries to access destroyed parallel context.  By reverting the early
shutdown code, we might lose statistics in some cases of Limit over
Gather, but that will require further study to fix."

I am not sure but we can even add a comment in the place where we are
removing some code (in nodeLimit.c) to indicate that 'Ideally we
should shutdown parallel resources here to get the correct stats, but
that would lead to rescans misbehaving when there is a Gather [Merge]
node beneath it.  (Explain the reason for misbehavior and the ideas we
discussed in this thread to fix the same) ........."

I can try to come up with comments in nodeLimit.c on the above lines
if we think that is a good idea?

[1] - https://www.postgresql.org/message-id/CAA4eK1Ja-eoavXcr0eq7w7hP%3D64VP49k%3DNMFxwhtK28NHfBOdA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Dilip Kumar
Date:
Subject: Re: Questions/Observations related to Gist vacuum
Next
From: Amit Kapila
Date:
Subject: Re: Questions/Observations related to Gist vacuum