Thread: [HACKERS] Event triggers + table partitioning cause server crash in current master

Hackers,

I discovered a reproducible crash using event triggers in the current
development version, 29c7d5e4844443acaa74a0d06dd6c70b320bb315.
I was getting a crash before this version, and cloned a fresh copy of
the sources to be sure I was up to date, so I don't think the bug can be
attributed to Andres' commit.  (The prior version I was testing against
was heavily modified by me, so I recreated the bug using the latest
standard, unmodified sources.)

I create both before and after event triggers early in the regression test
schedule, which then fire here and there during the following tests, leading
fairly reproducibly to the server crashing somewhere during the test suite.
These crashes do not happen for me without the event triggers being added
to the tests.  Many tests show as 'FAILED' simply because the logging
that happens in the event triggers creates unexpected output for the test.
Those "failures" are expected.  The server crashes are not.

The server logs suggest the crashes might be related to partitioned tables.

Please find attached the patch that includes my changes to the sources
for recreating this bug.  The logs and regression.diffs are a bit large; let
me know if you need them.

I built using the command

./configure --enable-cassert --enable-tap-tests && make -j4 && make check

Mark Dilger


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On 2017/05/14 12:03, Mark Dilger wrote:
> Hackers,
> 
> I discovered a reproducible crash using event triggers in the current
> development version, 29c7d5e4844443acaa74a0d06dd6c70b320bb315.
> I was getting a crash before this version, and cloned a fresh copy of
> the sources to be sure I was up to date, so I don't think the bug can be
> attributed to Andres' commit.  (The prior version I was testing against
> was heavily modified by me, so I recreated the bug using the latest
> standard, unmodified sources.)
> 
> I create both before and after event triggers early in the regression test
> schedule, which then fire here and there during the following tests, leading
> fairly reproducibly to the server crashing somewhere during the test suite.
> These crashes do not happen for me without the event triggers being added
> to the tests.  Many tests show as 'FAILED' simply because the logging
> that happens in the event triggers creates unexpected output for the test.
> Those "failures" are expected.  The server crashes are not.
> 
> The server logs suggest the crashes might be related to partitioned tables.
> 
> Please find attached the patch that includes my changes to the sources
> for recreating this bug.  The logs and regression.diffs are a bit large; let
> me know if you need them.
> 
> I built using the command
> 
> ./configure --enable-cassert --enable-tap-tests && make -j4 && make check

Thanks for the report and providing steps to reproduce.

It seems that it is indeed a bug related to creating range-partitioned
tables.  DefineRelation() calls AlterTableInternal() to add NOT NULL
constraints on the range partition key columns, but the code fails to
first initialize the event trigger context information.  Attached patch
should fix that.

Thanks to the above test case, I also discovered that in the case of
creating a partition, manipulations performed by MergeAttributes() on the
input schema list may cause it to become invalid, that is, the List
metadata (length) will no longer match the reality, because while the
ListCells are deleted from the input list, the List pointer passed to
list_delete_cell does not point to the same list.  This caused a crash
when the CreateStmt in question was subsequently passed to copyObject,
which tried to access CreateStmt.tableElts that has become invalid as just
described.  The attached patch also takes care of that.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
> On May 14, 2017, at 11:02 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2017/05/14 12:03, Mark Dilger wrote:
>> Hackers,
>>
>> I discovered a reproducible crash using event triggers in the current
>> development version, 29c7d5e4844443acaa74a0d06dd6c70b320bb315.
>> I was getting a crash before this version, and cloned a fresh copy of
>> the sources to be sure I was up to date, so I don't think the bug can be
>> attributed to Andres' commit.  (The prior version I was testing against
>> was heavily modified by me, so I recreated the bug using the latest
>> standard, unmodified sources.)
>>
>> I create both before and after event triggers early in the regression test
>> schedule, which then fire here and there during the following tests, leading
>> fairly reproducibly to the server crashing somewhere during the test suite.
>> These crashes do not happen for me without the event triggers being added
>> to the tests.  Many tests show as 'FAILED' simply because the logging
>> that happens in the event triggers creates unexpected output for the test.
>> Those "failures" are expected.  The server crashes are not.
>>
>> The server logs suggest the crashes might be related to partitioned tables.
>>
>> Please find attached the patch that includes my changes to the sources
>> for recreating this bug.  The logs and regression.diffs are a bit large; let
>> me know if you need them.
>>
>> I built using the command
>>
>> ./configure --enable-cassert --enable-tap-tests && make -j4 && make check
>
> Thanks for the report and providing steps to reproduce.
>
> It seems that it is indeed a bug related to creating range-partitioned
> tables.  DefineRelation() calls AlterTableInternal() to add NOT NULL
> constraints on the range partition key columns, but the code fails to
> first initialize the event trigger context information.  Attached patch
> should fix that.
>
> Thanks to the above test case, I also discovered that in the case of
> creating a partition, manipulations performed by MergeAttributes() on the
> input schema list may cause it to become invalid, that is, the List
> metadata (length) will no longer match the reality, because while the
> ListCells are deleted from the input list, the List pointer passed to
> list_delete_cell does not point to the same list.  This caused a crash
> when the CreateStmt in question was subsequently passed to copyObject,
> which tried to access CreateStmt.tableElts that has become invalid as just
> described.  The attached patch also takes care of that.

I can confirm that this fixes the crash that I was seeing.  I have read
through the patch briefly, but will give it a more thorough review in the
next few hours.

Many thanks for your attention on this!

Mark Dilger


> On May 15, 2017, at 6:49 AM, Mark Dilger <hornschnorter@gmail.com> wrote:
> 
> I can confirm that this fixes the crash that I was seeing.  I have read
> through the patch briefly, but will give it a more thorough review in the
> next few hours.

My only negative comment is that your patch follows a precedent of using
event trigger commands named for AlterTable for operations other than
an ALTER TABLE command.  You clearly are not starting the precedent
here, as they are already used for IndexStmt and ViewStmt processing, but
expanding the precedent by using them in DefineRelation, where they were
not used before, seems to move in the wrong direction.  Either the functions
should be renamed to something more general to avoid implying that the
function is ALTER TABLE specific, or different functions should be defined
and used, or ...?  I am uncertain what the right solution would be.

Thoughts?

Mark Dilger



On 2017/05/16 1:18, Mark Dilger wrote:
> 
>> On May 15, 2017, at 6:49 AM, Mark Dilger <hornschnorter@gmail.com> wrote:
>>
>> I can confirm that this fixes the crash that I was seeing.  I have read
>> through the patch briefly, but will give it a more thorough review in the
>> next few hours.

Thanks a lot for the review.

> My only negative comment is that your patch follows a precedent of using
> event trigger commands named for AlterTable for operations other than
> an ALTER TABLE command.  You clearly are not starting the precedent
> here, as they are already used for IndexStmt and ViewStmt processing, but
> expanding the precedent by using them in DefineRelation, where they were
> not used before, seems to move in the wrong direction.  Either the functions
> should be renamed to something more general to avoid implying that the
> function is ALTER TABLE specific, or different functions should be defined
> and used, or ...?  I am uncertain what the right solution would be.

Hmm.  I think an alternative to doing what I did in my patch is to get rid
of calling AlterTableInternal() from DefineRelation() altogether.
Instead, the required ALTER TABLE commands can be added during the
transform phase, which will create a new AlterTableStmt and add it to the
list of things to be done after the relation is defined.  That way,
ProcessUtilitySlow() takes care of doing the event trigger stuff instead
of having to worry about it ourselves.  Attached updated patch does that.

PS: actually, we're talking elsewhere [1] of getting rid of adding
explicit NOT NULL constraints on range partition keys altogether, so even
if we apply this patch to fix the bug, there is a good chance that the
code will go away (of course, the bug won't exist too)

I divided the patch into 2 parts.

0001 fixes the bug you reported (not so directly though as my previous
     patch did)

0002 fixes the bug I found while working on this, whereby the content of
     CreateStmt will become invalid when creating partitions which could
     cause crash in certain case

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/8e2dd63d-c6fb-bb74-3c2b-ed6d63629c9d%40lab.ntt.co.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
> On May 15, 2017, at 9:37 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> On 2017/05/16 1:18, Mark Dilger wrote:
>>
>>> On May 15, 2017, at 6:49 AM, Mark Dilger <hornschnorter@gmail.com> wrote:
>>>
>>> I can confirm that this fixes the crash that I was seeing.  I have read
>>> through the patch briefly, but will give it a more thorough review in the
>>> next few hours.
>
> Thanks a lot for the review.
>
>> My only negative comment is that your patch follows a precedent of using
>> event trigger commands named for AlterTable for operations other than
>> an ALTER TABLE command.  You clearly are not starting the precedent
>> here, as they are already used for IndexStmt and ViewStmt processing, but
>> expanding the precedent by using them in DefineRelation, where they were
>> not used before, seems to move in the wrong direction.  Either the functions
>> should be renamed to something more general to avoid implying that the
>> function is ALTER TABLE specific, or different functions should be defined
>> and used, or ...?  I am uncertain what the right solution would be.
>
> Hmm.  I think an alternative to doing what I did in my patch is to get rid
> of calling AlterTableInternal() from DefineRelation() altogether.
> Instead, the required ALTER TABLE commands can be added during the
> transform phase, which will create a new AlterTableStmt and add it to the
> list of things to be done after the relation is defined.  That way,
> ProcessUtilitySlow() takes care of doing the event trigger stuff instead
> of having to worry about it ourselves.  Attached updated patch does that.
>
> PS: actually, we're talking elsewhere [1] of getting rid of adding
> explicit NOT NULL constraints on range partition keys altogether, so even
> if we apply this patch to fix the bug, there is a good chance that the
> code will go away (of course, the bug won't exist too)
>
> I divided the patch into 2 parts.
>
> 0001 fixes the bug you reported (not so directly though as my previous
>     patch did)
>
> 0002 fixes the bug I found while working on this, whereby the content of
>     CreateStmt will become invalid when creating partitions which could
>     cause crash in certain case

Thanks for your continued efforts.  It is passed midnight here, so I think I will
save reviewing your new patches until sometime tomorrow morning.

Since we are passed feature freeze, one option is to keep the patch simple
and not do more than you did in your first patch, though until I review the new
patch, I don't know for sure.

Mark Dilger


On Mon, May 15, 2017 at 03:02:54PM +0900, Amit Langote wrote:
> On 2017/05/14 12:03, Mark Dilger wrote:
> > I discovered a reproducible crash using event triggers in the current
> > development version, 29c7d5e4844443acaa74a0d06dd6c70b320bb315.
> > I was getting a crash before this version, and cloned a fresh copy of
> > the sources to be sure I was up to date, so I don't think the bug can be
> > attributed to Andres' commit.  (The prior version I was testing against
> > was heavily modified by me, so I recreated the bug using the latest
> > standard, unmodified sources.)
> > 
> > I create both before and after event triggers early in the regression test
> > schedule, which then fire here and there during the following tests, leading
> > fairly reproducibly to the server crashing somewhere during the test suite.
> > These crashes do not happen for me without the event triggers being added
> > to the tests.  Many tests show as 'FAILED' simply because the logging
> > that happens in the event triggers creates unexpected output for the test.
> > Those "failures" are expected.  The server crashes are not.
> > 
> > The server logs suggest the crashes might be related to partitioned tables.
> > 
> > Please find attached the patch that includes my changes to the sources
> > for recreating this bug.  The logs and regression.diffs are a bit large; let
> > me know if you need them.
> > 
> > I built using the command
> > 
> > ./configure --enable-cassert --enable-tap-tests && make -j4 && make check
> 
> Thanks for the report and providing steps to reproduce.
> 
> It seems that it is indeed a bug related to creating range-partitioned
> tables.  DefineRelation() calls AlterTableInternal() to add NOT NULL
> constraints on the range partition key columns, but the code fails to
> first initialize the event trigger context information.  Attached patch
> should fix that.
> 
> Thanks to the above test case, I also discovered that in the case of
> creating a partition, manipulations performed by MergeAttributes() on the
> input schema list may cause it to become invalid, that is, the List
> metadata (length) will no longer match the reality, because while the
> ListCells are deleted from the input list, the List pointer passed to
> list_delete_cell does not point to the same list.  This caused a crash
> when the CreateStmt in question was subsequently passed to copyObject,
> which tried to access CreateStmt.tableElts that has become invalid as just
> described.  The attached patch also takes care of that.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



On Wed, May 17, 2017 at 5:34 PM, Noah Misch <noah@leadboat.com> wrote:
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.

I believe that the originally-reported crash has been fixed by
3ec76ff1f2cf52e9b900349957b42d28128b7bc7.

I have pushed ac8d7e1b834e252c9aa8d5750f70a86ca74228b8 to fix the
other issue which Amit spotted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company