Thread: COPY FROM crash

COPY FROM crash

From

Zhang Mingli

Date:

30 July 2024, 03:50:54

Hi, all

I got a crash when copy partition tables with mass data in Cloudberry DB[0](based on Postgres14.4, Greenplum 7).

I have a test on Postgres and it has the similar issue(different places but same function).

However it’s a little hard to reproduce because it happened when inserting next tuple after a previous copy multi insert buffer is flushed.

To reproduce easily, change the Macros to:

#define MAX_BUFFERED_TUPLES 1
#define MAX_PARTITION_BUFFERS 0

Config and make install, when initdb, a core dump will be as:

#0 0x000055de617211b9 in CopyMultiInsertInfoNextFreeSlot (miinfo=0x7ffce496d360, rri=0x55de6368ba88)
at copyfrom.c:592
#1 0x000055de61721ff1 in CopyFrom (cstate=0x55de63592ce8) at copyfrom.c:985
#2 0x000055de6171dd86 in DoCopy (pstate=0x55de63589e00, stmt=0x55de635347d8, stmt_location=0, stmt_len=195,
processed=0x7ffce496d590) at copy.c:306
#3 0x000055de61ad7ce8 in standard_ProcessUtility (pstmt=0x55de635348a8,
queryString=0x55de63533960 "COPY information_schema.sql_features (feature_id, feature_name, sub_feature_id, sub
_feature_name, is_supported, comments) FROM E'/home/gpadmin/install/pg17/share/postgresql/sql_features.txt';\n",
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55de620b0ce0 <debugtupDR>,
qc=0x7ffce496d910) at utility.c:735
#4 0x000055de61ad7614 in ProcessUtility (pstmt=0x55de635348a8,
queryString=0x55de63533960 "COPY information_schema.sql_features (feature_id, feature_name, sub_feature_id, sub
_feature_name, is_supported, comments) FROM E'/home/gpadmin/install/pg17/share/postgresql/sql_features.txt';\n",
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x55de620b0ce0 <debugtupDR>,
qc=0x7ffce496d910) at utility.c:523
#5 0x000055de61ad5e8f in PortalRunUtility (portal=0x55de633dd7a0, pstmt=0x55de635348a8, isTopLevel=true,
setHoldSnapshot=false, dest=0x55de620b0ce0 <debugtupDR>, qc=0x7ffce496d910) at pquery.c:1158
#6 0x000055de61ad6106 in PortalRunMulti (portal=0x55de633dd7a0, isTopLevel=true, setHoldSnapshot=false,
dest=0x55de620b0ce0 <debugtupDR>, altdest=0x55de620b0ce0 <debugtupDR>, qc=0x7ffce496d910) at pquery.c:1315
#7 0x000055de61ad5550 in PortalRun (portal=0x55de633dd7a0, count=9223372036854775807, isTopLevel=true,
run_once=true, dest=0x55de620b0ce0 <debugtupDR>, altdest=0x55de620b0ce0 <debugtupDR>, qc=0x7ffce496d910)
at pquery.c:791```

The root cause is: we may call CopyMultiInsertInfoFlush() to flush buffer during COPY tuples, ex: insert from next tuple,
CopyMultiInsertInfoNextFreeSlot() will get a crash due to null pointer of buffer.

To fix it: instead of call CopyMultiInsertInfoSetupBuffer() outside, I put it into CopyMultiInsertInfoNextFreeSlot() to avoid such issues.

[0] https://github.com/cloudberrydb/cloudberrydb

Zhang Mingli

www.hashdata.xyz

Attachment

v0-0001-Fix-COPY-FROM-crash-due-to-buffer-flush.patch

Re: COPY FROM crash

From

David Rowley

Date:

30 July 2024, 05:35:36

On Tue, 30 Jul 2024 at 15:52, Zhang Mingli <zmlpostgres@gmail.com> wrote:
> I have a test on Postgres and it has the similar issue(different places but same function).
>
> However it’s a little hard to reproduce because it happened when inserting next tuple after a previous copy multi
insertbuffer is flushed. 
>
> To reproduce easily, change the Macros to:
>
> #define MAX_BUFFERED_TUPLES 1
> #define MAX_PARTITION_BUFFERS 0

I think you're going to need to demonstrate to us there's an actual
PostgreSQL bug here with a test case that causes a crash without
changing the above definitions.

It seems to me that it's not valid to set MAX_PARTITION_BUFFERS to
anything less than 2 due to the code inside
CopyMultiInsertInfoFlush(). If we find the CopyMultiInsertBuffer for
'curr_rri' then that code would misbehave if the list only contained a
single CopyMultiInsertBuffer due to the expectation there's another
item in the list after the list_delete_first().  If you're only able
to get it to misbehave by setting MAX_PARTITION_BUFFERS to less than
2, then my suggested fix would be to add a comment to say that values
less than to are not supported.

David

Re: COPY FROM crash

From

Kirill Reshke

Date:

30 July 2024, 05:37:11

Hi!

On Tue, 30 Jul 2024 at 08:52, Zhang Mingli <zmlpostgres@gmail.com> wrote:
>
> Hi, all
>
> I got a crash when copy partition tables with mass data in Cloudberry DB[0](based on Postgres14.4, Greenplum 7).
>
> I have a test on Postgres and it has the similar issue(different places but same function).

Just to be clear, you are facing this on HEAD, on on REL_14_STABLE?


> However it’s a little hard to reproduce because it happened when inserting next tuple after a previous copy multi
insertbuffer is flushed. 
>
> To reproduce easily, change the Macros to:
>
> #define MAX_BUFFERED_TUPLES 1
> #define MAX_PARTITION_BUFFERS 0

This way it's harder to believe that the problem persists with the
original settings. Are these values valid?

Re: COPY FROM crash

From

Zhang Mingli

Date:

30 July 2024, 07:06:01

Hi,

Zhang Mingli

www.hashdata.xyz

On Jul 30, 2024 at 13:35 +0800, David Rowley <dgrowleyml@gmail.com>, wrote:

If you're only able
to get it to misbehave by setting MAX_PARTITION_BUFFERS to less than
2, then my suggested fix would be to add a comment to say that values
less than to are not supported.

Right.
On Postgres this crash could only happen if it is set to zero.
I’ve updated the comments in patch v1 with an additional Assert.

Attachment

v1-0001-Add-comments-for-MAX_PARTITION_BUFFERS-to-avoid-misb.patch

Re: COPY FROM crash

From

Zhang Mingli

Date:

30 July 2024, 07:07:10

Hi,

Zhang Mingli

www.hashdata.xyz

On Jul 30, 2024 at 13:37 +0800, Kirill Reshke <reshkekirill@gmail.com>, wrote:

Just to be clear, you are facing this on HEAD, on on REL_14_STABLE?

Postgres HEAD.

Re: COPY FROM crash

From

David Rowley

Date:

30 July 2024, 08:22:42

On Tue, 30 Jul 2024 at 19:06, Zhang Mingli <zmlpostgres@gmail.com> wrote:
> I’ve updated the comments in patch v1 with an additional Assert.

Thanks. I adjusted a little and used a StaticAssert instead then pushed.

StaticAssert seems better as invalid values will result in compilation failure.

David