Thread: [BUGS] BUG #14657: Server process segmentation fault in v10,May 10th dev snapshot

[BUGS] BUG #14657: Server process segmentation fault in v10,May 10th dev snapshot

From
sveinn.sveinsson@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      14657
Logged by:          Sveinn Sveinsson
Email address:      sveinn.sveinsson@gmail.com
PostgreSQL version: Unsupported/Unknown
Operating system:   Linux x86_64
Description:

The following causes segmentation fault in v10, May 10th development
snapshot:

create table test (sd timestamp,anb varchar(16)) partition by range (sd);

create table test_1 partition of test for values from ('2017-01-01') to
('2017-02-01');

create index test_1_a on test_1 (anb,sd);

insert into test values ('2017-01-01','12345');

select min(sd), max(sd) from test where anb='12345';

The server log file shows:

2017-05-16 09:36:13.243 UTC [1503] LOG:  server process (PID 3474) was
terminated by signal 11: Segmentation fault
2017-05-16 09:36:13.243 UTC [1503] DETAIL:  Failed process was running:
select min(sd), max(sd) from test where anb='12345';
2017-05-16 09:36:13.244 UTC [1503] LOG:  terminating any other active server
processes
2017-05-16 09:36:13.245 UTC [3463] WARNING:  terminating connection because
of crash of another server process

The stack trace is (no debug info):

Program terminated with signal 11, Segmentation fault.
#0  0x000000000061ab1b in list_nth ()
(gdb) bt
#0  0x000000000061ab1b in list_nth ()
#1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
#2  0x00000000005f4d52 in ExecInitMergeAppend ()
#3  0x00000000005e0365 in ExecInitNode ()
#4  0x00000000005f35a7 in ExecInitLimit ()
#5  0x00000000005e00f3 in ExecInitNode ()
#6  0x00000000005dd207 in standard_ExecutorStart ()
#7  0x00000000006f96d2 in PortalStart ()
#8  0x00000000006f5c7f in exec_simple_query ()
#9  0x00000000006f6fac in PostgresMain ()
#10 0x0000000000475cdc in ServerLoop ()
#11 0x0000000000692ffa in PostmasterMain ()
#12 0x0000000000476600 in main ()

Regards,
Sveinn Sveinsson.




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On Wed, May 17, 2017 at 7:41 PM,  <sveinn.sveinsson@gmail.com> wrote:
> (gdb) bt
> #0  0x000000000061ab1b in list_nth ()
> #1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
> #2  0x00000000005f4d52 in ExecInitMergeAppend ()
> #3  0x00000000005e0365 in ExecInitNode ()
> #4  0x00000000005f35a7 in ExecInitLimit ()
> #5  0x00000000005e00f3 in ExecInitNode ()
> #6  0x00000000005dd207 in standard_ExecutorStart ()
> #7  0x00000000006f96d2 in PortalStart ()
> #8  0x00000000006f5c7f in exec_simple_query ()
> #9  0x00000000006f6fac in PostgresMain ()
> #10 0x0000000000475cdc in ServerLoop ()
> #11 0x0000000000692ffa in PostmasterMain ()
> #12 0x0000000000476600 in main ()

Seems like the issue is that the plans under multiple subroots are
pointing to the same partitioned_rels.

If I am not getting it wrong "set_plan_refs(PlannerInfo *root, Plan
*plan, int rtoffset)" the rtoffset is specific to the subroot. Now,
problem is that set_plan_refs called for different subroot is updating
the same partition_rel info and make this value completely wrong which
will ultimately make ExecLockNonLeafAppendTables to access the out of
bound "rte" index.

set_plan_refs
{
[clipped]
case T_MergeAppend:
{
    [clipped]

    foreach(l, splan->partitioned_rels)
    {
         lfirst_int(l) += rtoffset;


I think the solution should be that create_merge_append_path make the
copy of partitioned_rels list?

Attached patch fixes the problem but I am not completely sure about the fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment
On 2017/05/18 2:14, Dilip Kumar wrote:
> On Wed, May 17, 2017 at 7:41 PM,  <sveinn.sveinsson@gmail.com> wrote:
>> (gdb) bt
>> #0  0x000000000061ab1b in list_nth ()
>> #1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
>> #2  0x00000000005f4d52 in ExecInitMergeAppend ()
>> #3  0x00000000005e0365 in ExecInitNode ()
>> #4  0x00000000005f35a7 in ExecInitLimit ()
>> #5  0x00000000005e00f3 in ExecInitNode ()
>> #6  0x00000000005dd207 in standard_ExecutorStart ()
>> #7  0x00000000006f96d2 in PortalStart ()
>> #8  0x00000000006f5c7f in exec_simple_query ()
>> #9  0x00000000006f6fac in PostgresMain ()
>> #10 0x0000000000475cdc in ServerLoop ()
>> #11 0x0000000000692ffa in PostmasterMain ()
>> #12 0x0000000000476600 in main ()

Thanks for the test case Sveinn and thanks Dilip for analyzing.

> Seems like the issue is that the plans under multiple subroots are
> pointing to the same partitioned_rels.

That's correct.

> If I am not getting it wrong "set_plan_refs(PlannerInfo *root, Plan
> *plan, int rtoffset)" the rtoffset is specific to the subroot. Now,
> problem is that set_plan_refs called for different subroot is updating
> the same partition_rel info and make this value completely wrong which
> will ultimately make ExecLockNonLeafAppendTables to access the out of
> bound "rte" index.

Yes.

> set_plan_refs
> {
> [clipped]
> case T_MergeAppend:
> {
>     [clipped]
> 
>     foreach(l, splan->partitioned_rels)
>     {
>          lfirst_int(l) += rtoffset;
> 
> 
> I think the solution should be that create_merge_append_path make the
> copy of partitioned_rels list?

Yes, partitioned_rels should be copied.

> Attached patch fixes the problem but I am not completely sure about the fix.

Thanks for creating the patch, although I think a better fix would be to
make get_partitioned_child_rels() do the list_copy.  That way, any other
users of partitioned_rels will not suffer the same issue.  Attached patch
implements that, along with a regression test.

Added to the open items.

Thanks,
Amit

-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Attachment
On 2017/05/18 10:49, Amit Langote wrote:
> On 2017/05/18 2:14, Dilip Kumar wrote:
>> On Wed, May 17, 2017 at 7:41 PM,  <sveinn.sveinsson@gmail.com> wrote:
>>> (gdb) bt
>>> #0  0x000000000061ab1b in list_nth ()
>>> #1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
>>> #2  0x00000000005f4d52 in ExecInitMergeAppend ()
>>> #3  0x00000000005e0365 in ExecInitNode ()
>>> #4  0x00000000005f35a7 in ExecInitLimit ()
>>> #5  0x00000000005e00f3 in ExecInitNode ()
>>> #6  0x00000000005dd207 in standard_ExecutorStart ()
>>> #7  0x00000000006f96d2 in PortalStart ()
>>> #8  0x00000000006f5c7f in exec_simple_query ()
>>> #9  0x00000000006f6fac in PostgresMain ()
>>> #10 0x0000000000475cdc in ServerLoop ()
>>> #11 0x0000000000692ffa in PostmasterMain ()
>>> #12 0x0000000000476600 in main ()
> 
> Thanks for the test case Sveinn and thanks Dilip for analyzing.
> 
>> Seems like the issue is that the plans under multiple subroots are
>> pointing to the same partitioned_rels.
> 
> That's correct.
> 
>> If I am not getting it wrong "set_plan_refs(PlannerInfo *root, Plan
>> *plan, int rtoffset)" the rtoffset is specific to the subroot. Now,
>> problem is that set_plan_refs called for different subroot is updating
>> the same partition_rel info and make this value completely wrong which
>> will ultimately make ExecLockNonLeafAppendTables to access the out of
>> bound "rte" index.
> 
> Yes.
> 
>> set_plan_refs
>> {
>> [clipped]
>> case T_MergeAppend:
>> {
>>     [clipped]
>>
>>     foreach(l, splan->partitioned_rels)
>>     {
>>          lfirst_int(l) += rtoffset;
>>
>>
>> I think the solution should be that create_merge_append_path make the
>> copy of partitioned_rels list?
> 
> Yes, partitioned_rels should be copied.
> 
>> Attached patch fixes the problem but I am not completely sure about the fix.
> 
> Thanks for creating the patch, although I think a better fix would be to
> make get_partitioned_child_rels() do the list_copy.  That way, any other
> users of partitioned_rels will not suffer the same issue.  Attached patch
> implements that, along with a regression test.
> 
> Added to the open items.

Oops, forgot to cc -hackers.  Patch attached again.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Thu, May 18, 2017 at 7:19 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks for creating the patch, although I think a better fix would be to
> make get_partitioned_child_rels() do the list_copy.  That way, any other
> users of partitioned_rels will not suffer the same issue.  Attached patch
> implements that, along with a regression test.

Correct! This is generic fix.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

The patch fixed the problem, thanks a lot.
Regards,
Sveinn.

On fim 18.maí 2017 01:53, Amit Langote wrote:
> On 2017/05/18 10:49, Amit Langote wrote:
>> On 2017/05/18 2:14, Dilip Kumar wrote:
>>> On Wed, May 17, 2017 at 7:41 PM,  <sveinn.sveinsson@gmail.com> wrote:
>>>> (gdb) bt
>>>> #0  0x000000000061ab1b in list_nth ()
>>>> #1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
>>>> #2  0x00000000005f4d52 in ExecInitMergeAppend ()
>>>> #3  0x00000000005e0365 in ExecInitNode ()
>>>> #4  0x00000000005f35a7 in ExecInitLimit ()
>>>> #5  0x00000000005e00f3 in ExecInitNode ()
>>>> #6  0x00000000005dd207 in standard_ExecutorStart ()
>>>> #7  0x00000000006f96d2 in PortalStart ()
>>>> #8  0x00000000006f5c7f in exec_simple_query ()
>>>> #9  0x00000000006f6fac in PostgresMain ()
>>>> #10 0x0000000000475cdc in ServerLoop ()
>>>> #11 0x0000000000692ffa in PostmasterMain ()
>>>> #12 0x0000000000476600 in main ()
>> Thanks for the test case Sveinn and thanks Dilip for analyzing.
>>
>>> Seems like the issue is that the plans under multiple subroots are
>>> pointing to the same partitioned_rels.
>> That's correct.
>>
>>> If I am not getting it wrong "set_plan_refs(PlannerInfo *root, Plan
>>> *plan, int rtoffset)" the rtoffset is specific to the subroot. Now,
>>> problem is that set_plan_refs called for different subroot is updating
>>> the same partition_rel info and make this value completely wrong which
>>> will ultimately make ExecLockNonLeafAppendTables to access the out of
>>> bound "rte" index.
>> Yes.
>>
>>> set_plan_refs
>>> {
>>> [clipped]
>>> case T_MergeAppend:
>>> {
>>>     [clipped]
>>>
>>>     foreach(l, splan->partitioned_rels)
>>>     {
>>>          lfirst_int(l) += rtoffset;
>>>
>>>
>>> I think the solution should be that create_merge_append_path make the
>>> copy of partitioned_rels list?
>> Yes, partitioned_rels should be copied.
>>
>>> Attached patch fixes the problem but I am not completely sure about the fix.
>> Thanks for creating the patch, although I think a better fix would be to
>> make get_partitioned_child_rels() do the list_copy.  That way, any other
>> users of partitioned_rels will not suffer the same issue.  Attached patch
>> implements that, along with a regression test.
>>
>> Added to the open items.
> Oops, forgot to cc -hackers.  Patch attached again.
>
> Thanks,
> Amit




On Thu, May 18, 2017 at 7:23 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/05/18 10:49, Amit Langote wrote:
>> On 2017/05/18 2:14, Dilip Kumar wrote:
>>> On Wed, May 17, 2017 at 7:41 PM,  <sveinn.sveinsson@gmail.com> wrote:
>>>> (gdb) bt
>>>> #0  0x000000000061ab1b in list_nth ()
>>>> #1  0x00000000005e4081 in ExecLockNonLeafAppendTables ()
>>>> #2  0x00000000005f4d52 in ExecInitMergeAppend ()
>>>> #3  0x00000000005e0365 in ExecInitNode ()
>>>> #4  0x00000000005f35a7 in ExecInitLimit ()
>>>> #5  0x00000000005e00f3 in ExecInitNode ()
>>>> #6  0x00000000005dd207 in standard_ExecutorStart ()
>>>> #7  0x00000000006f96d2 in PortalStart ()
>>>> #8  0x00000000006f5c7f in exec_simple_query ()
>>>> #9  0x00000000006f6fac in PostgresMain ()
>>>> #10 0x0000000000475cdc in ServerLoop ()
>>>> #11 0x0000000000692ffa in PostmasterMain ()
>>>> #12 0x0000000000476600 in main ()
>>
>> Thanks for the test case Sveinn and thanks Dilip for analyzing.
>>
>>> Seems like the issue is that the plans under multiple subroots are
>>> pointing to the same partitioned_rels.
>>
>> That's correct.
>>
>>> If I am not getting it wrong "set_plan_refs(PlannerInfo *root, Plan
>>> *plan, int rtoffset)" the rtoffset is specific to the subroot. Now,
>>> problem is that set_plan_refs called for different subroot is updating
>>> the same partition_rel info and make this value completely wrong which
>>> will ultimately make ExecLockNonLeafAppendTables to access the out of
>>> bound "rte" index.
>>
>> Yes.
>>
>>> set_plan_refs
>>> {
>>> [clipped]
>>> case T_MergeAppend:
>>> {
>>>     [clipped]
>>>
>>>     foreach(l, splan->partitioned_rels)
>>>     {
>>>          lfirst_int(l) += rtoffset;
>>>
>>>
>>> I think the solution should be that create_merge_append_path make the
>>> copy of partitioned_rels list?
>>
>> Yes, partitioned_rels should be copied.
>>
>>> Attached patch fixes the problem but I am not completely sure about the fix.
>>
>> Thanks for creating the patch, although I think a better fix would be to
>> make get_partitioned_child_rels() do the list_copy.  That way, any other
>> users of partitioned_rels will not suffer the same issue.  Attached patch
>> implements that, along with a regression test.
>>
>> Added to the open items.
>
> Oops, forgot to cc -hackers.  Patch attached again.

May be we should add a comment as to why the copy is needed.

We still have the same copy shared across multiple append paths and
set_plan_refs would change change it underneath those. May not be a
problem right now but may be a problem in the future. Another option,
which consumes a bit less memory is to make a copy at the time of
planning if the path gets selected as the cheapest path.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



On Fri, May 19, 2017 at 6:07 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> We still have the same copy shared across multiple append paths and
> set_plan_refs would change change it underneath those. May not be a
> problem right now but may be a problem in the future.

I agree.  I think it's better for the path-creation functions to copy
the list, so that there is no surprising sharing of substructure.
set_plan_refs() obviously expects this data to be unshared, and this
seems like the best way to ensure that's true in all cases.

Committed that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company