Thread: Segfault in backend CTE code

Segfault in backend CTE code

From

Phil Sorber

Date:

24 January 2012, 04:29:52

Running Postgres 9.1.2.

I've attached a backtrace. Looking at the backtrace it looks like
ExecGetResultType() gets called with a NULL planstate and causes the
segmentation fault:

https://github.com/postgres/postgres/blob/master/src/backend/executor/execUtils.c#L470

Following the stack I see that an optimization for writeable CTE's
inserts a NULL subplanstate:

https://github.com/postgres/postgres/blob/master/src/backend/executor/execMain.c#L2344

ExecInitCteScan() is what eventually passes it to ExecGetResultType():

https://github.com/postgres/postgres/blob/master/src/backend/executor/nodeCtescan.c#L255

I've also attached a proposed fix. In this optimized case it says that
we won't ever use the subplan anyway, so I figured that not setting
the scan tuple type won't matter. I also added an Assert() to
ExecGetResultType(). I modified the declaration of 'slot' to remove a
compiler warning. This patch is against master but should backport to
9.1 cleanly. It also passed all regression tests. If you end up using
this patch please also credit Rick Pufky who helped me with this.

Attachment

Re: Segfault in backend CTE code

From

Tom Lane

Date:

24 January 2012, 04:43:32

Phil Sorber <phil@omniti.com> writes:
> I've attached a backtrace.

How about a test case?  I have no faith in the proposed patch without
some evidence of what it's supposed to fix.

            regards, tom lane

Re: Segfault in backend CTE code

From

Phil Sorber

Date:

24 January 2012, 19:23:07

On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Phil Sorber <phil@omniti.com> writes:
>> I've attached a backtrace.
>
> How about a test case?  I have no faith in the proposed patch without
> some evidence of what it's supposed to fix.

We are having trouble coming up with a test case that can reliably
reproduce this. I has happened 5 times since Friday. There was another
one this morning and we have more logging turned on now and I have
attached more information. Is there anything else we can do to help
debug it? This is a production machine and we may be forced into
applying the patch we came up with, but we wanted to have as good of a
solution as possible.

>
>                        regards, tom lane

I have a new backtrace which is almost identical to the old one. Only
memory addresses differ. I am attaching it anyway. I am also attaching
the pertinent logs with sensitive information changed. The log lines
are just the entries for the PID that segfaulted from right before it
happened. The PID was long lived but nothing was going on for minutes
of time before the segfault. I am also attaching the source of the
function that was called right before the segfault. Again with
sensitive information changed.

Thanks.

Attachment

Re: Segfault in backend CTE code

From

Tom Lane

Date:

24 January 2012, 20:03:33

Phil Sorber <phil@omniti.com> writes:
> On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> How about a test case?

> We are having trouble coming up with a test case that can reliably
> reproduce this.

The stack traces run through the EvalPlanQual machinery, which is only
going to be entered when attempting to update/delete a row that's been
updated by a concurrent transaction.  So what I'd suggest for creating a
test case is

    (1) in a psql session, do
        BEGIN;
        UPDATE some-target-row;

    (2) in another psql session, call this function with arguments
        that will make it try to update that same row; it should
        block.

    (3) in the first session, COMMIT to unblock.

FWIW, having re-examined your patch with some caffeine in me, I don't
think it's right at all.  You can't just blow off setting the scan type
for a CTEScan node.  What it looks like to me is that the EvalPlanQual
code is not initializing the new execution tree correctly; but that
idea would be a lot easier to check into with a test case.

            regards, tom lane

Re: Segfault in backend CTE code

From

Phil Sorber

Date:

25 January 2012, 16:54:05

On Tue, Jan 24, 2012 at 4:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Phil Sorber <phil@omniti.com> writes:
>> On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> How about a test case?
>
>> We are having trouble coming up with a test case that can reliably
>> reproduce this.
>
> The stack traces run through the EvalPlanQual machinery, which is only
> going to be entered when attempting to update/delete a row that's been
> updated by a concurrent transaction.  So what I'd suggest for creating a
> test case is
>
>        (1) in a psql session, do
>                BEGIN;
>                UPDATE some-target-row;
>
>        (2) in another psql session, call this function with arguments
>            that will make it try to update that same row; it should
>            block.
>
>        (3) in the first session, COMMIT to unblock.
>

That helped a lot. I now have a simple test case that I can reliably
re-produce the segfault and now also a patch that fixes it. I had to
modify the patch slightly because while it fixed the first problem, it
just cascaded to another NULL deref from the same root cause. Both are
attached.

> FWIW, having re-examined your patch with some caffeine in me, I don't
> think it's right at all.  You can't just blow off setting the scan type
> for a CTEScan node.  What it looks like to me is that the EvalPlanQual
> code is not initializing the new execution tree correctly; but that
> idea would be a lot easier to check into with a test case.
>

If I understand what you are saying, then I agree that is the root of
the problem. The comment label's it as an optimization, but then later
fails to account for all the changes needed. My patch accounts for at
least one extra change that is needed. We could also remove the
"optimization" but I assumed it was there for a reason, especially
given the fact that someone took the time to make a comment about it.

The change was made in this commit by you:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;f=src/backend/executor/execMain.c;h=389af951552ff2209eae3e62fa147fef12329d4f

>                        regards, tom lane

Attachment

Re: Segfault in backend CTE code

From

Phil Sorber

Date:

25 January 2012, 20:01:55

I screwed up cut and paste when putting the test case together. The
reference to table user_data should be t1.

On Wed, Jan 25, 2012 at 12:47 PM, Phil Sorber <phil@omniti.com> wrote:
> On Tue, Jan 24, 2012 at 4:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Phil Sorber <phil@omniti.com> writes:
>>> On Tue, Jan 24, 2012 at 12:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> How about a test case?
>>
>>> We are having trouble coming up with a test case that can reliably
>>> reproduce this.
>>
>> The stack traces run through the EvalPlanQual machinery, which is only
>> going to be entered when attempting to update/delete a row that's been
>> updated by a concurrent transaction.  So what I'd suggest for creating a
>> test case is
>>
>>        (1) in a psql session, do
>>                BEGIN;
>>                UPDATE some-target-row;
>>
>>        (2) in another psql session, call this function with arguments
>>            that will make it try to update that same row; it should
>>            block.
>>
>>        (3) in the first session, COMMIT to unblock.
>>
>
> That helped a lot. I now have a simple test case that I can reliably
> re-produce the segfault and now also a patch that fixes it. I had to
> modify the patch slightly because while it fixed the first problem, it
> just cascaded to another NULL deref from the same root cause. Both are
> attached.
>
>> FWIW, having re-examined your patch with some caffeine in me, I don't
>> think it's right at all.  You can't just blow off setting the scan type
>> for a CTEScan node.  What it looks like to me is that the EvalPlanQual
>> code is not initializing the new execution tree correctly; but that
>> idea would be a lot easier to check into with a test case.
>>
>
> If I understand what you are saying, then I agree that is the root of
> the problem. The comment label's it as an optimization, but then later
> fails to account for all the changes needed. My patch accounts for at
> least one extra change that is needed. We could also remove the
> "optimization" but I assumed it was there for a reason, especially
> given the fact that someone took the time to make a comment about it.
>
> The change was made in this commit by you:
>
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;f=src/backend/executor/execMain.c;h=389af951552ff2209eae3e62fa147fef12329d4f
>
>>                        regards, tom lane

Attachment

CTE_segfault_test_case_v2.txt

Re: Segfault in backend CTE code

From

Tom Lane

Date:

25 January 2012, 21:14:03

Phil Sorber <phil@omniti.com> writes:
> That helped a lot. I now have a simple test case that I can reliably
> re-produce the segfault and now also a patch that fixes it.

[ pokes at that... ]  Hmm.  I think what this proves is that that
optimization in EvalPlanQualStart is just too cute for its own good,
and that the only safe fix is to take it out.  That code path is
(obviously) none too well tested, so we should not have it setting
up execution tree structures that are not like those used normally.
It's just begging for trouble.

            regards, tom lane

Re: Segfault in backend CTE code

From

Phil Sorber

Date:

27 January 2012, 19:03:15

On Wed, Jan 25, 2012 at 5:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Phil Sorber <phil@omniti.com> writes:
>> That helped a lot. I now have a simple test case that I can reliably
>> re-produce the segfault and now also a patch that fixes it.
>
> [ pokes at that... ] =A0Hmm. =A0I think what this proves is that that
> optimization in EvalPlanQualStart is just too cute for its own good,
> and that the only safe fix is to take it out. =A0That code path is
> (obviously) none too well tested, so we should not have it setting
> up execution tree structures that are not like those used normally.
> It's just begging for trouble.

I played around with removing the optimization, but there are other
pieces further down the line that are upset but having a ModifyTable
node in the execution tree. Specifically this:

http://git.postgresql.org/gitweb/?p=3Dpostgresql.git;a=3Dblob;f=3Dsrc/backe=
nd/executor/nodeModifyTable.c;h=3Dcf32dc569037f710ce6c43c4c93ee3a10cabe085;=
hb=3D389af951552ff2209eae3e62fa147fef12329d4f#l900

Not sure at all how to proceed from there so I am deferring. Perhaps
the original authors can be asked to weigh in? We changed our writable
CTE queries to update/insert loops so this is no longer a blocker for
us.

>
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0regards, tom lane

Re: Segfault in backend CTE code

From

Tom Lane

Date:

28 January 2012, 21:59:57

Phil Sorber <phil@omniti.com> writes:
> On Wed, Jan 25, 2012 at 5:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I played around with removing the optimization, but there are other
> pieces further down the line that are upset but having a ModifyTable
> node in the execution tree.

Hm, yeah, obviously this scenario has never been tested :-(.  I have
applied a patch for it, and also done some work so that it will get
tested by the buildfarm in future.  Thanks for the report!

> We changed our writable CTE queries to update/insert loops so this is
> no longer a blocker for us.

FWIW, that technique didn't really work anyway, as even with the patch
I observe that you get a "duplicate unique key" failure if two such
commands try to insert a new row concurrently.  This is because neither
UPDATE subquery will see an as-yet-uncommitted inserted row.

            regards, tom lane