Thread: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18055
Logged by:          ocean li
Email address:      ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system:   centos7 5.10.84 x86_64
Description:

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that: 
==>
#0  0x00007f744a7b9277 in raise () from /lib64/libc.so.6
#1  0x00007f744a7ba968 in abort () from /lib64/libc.so.6
#2  0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58
"!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c",
lineNumber=381) at assert.c:46
#3  0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98,
xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at
snapbuild.c:381
#4  0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0,
start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false,
fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>,
prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24
<LogicalOutputWrite>, update_progress=0x0) at logical.c:191
#5  0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216,
output_plugin_options=0x0, fast_forward=false, read_page=0xe53023
<logical_read_local_xlog_page>, prepare_write=0xe52df6
<LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>,
update_progress=0x0) at logical.c:486
#6  0x0000000000e53735 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259
#7  0x0000000000e53b1c in pg_logical_slot_get_changes
(fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393
#8  0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0)
at fmgr.c:2262
...
==>
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of  cores. 

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout*  logged
before each core occurred. (For some reason, I can't provide the information
of log)


Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
ocean_li_996
Date:
font{ line-height: 1.6; } ul,ol{ padding-left: 20px; list-style-position: inside; }
By applying the test.patch in attachment, I got similar core stacks following the procedure:
==>
postgres=# alter system set wal_level to 'logical';
-- restart server
postgres=# set statement_timeout to 1000;
SET
postgres=# select pg_create_logical_replication_slot('test', 'test_decoding', 0);
WARNING:  NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
pg_create_logical_replication_slot
------------------------------------
(test,0/423999D8)
(1 row)
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING:  NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
ERROR:  canceling statement due to statement timeout
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING:  NInitialRunningXacts is 3 , InitialRunningXacts is 0x373bc18
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
==>

And core stacks got above are
==>
(gdb) bt
#0  0x00007fbf2c987277 in raise () from /lib64/libc.so.6
#1  0x00007fbf2c988968 in abort () from /lib64/libc.so.6
#2  0x0000000000deb38e in ExceptionalCondition (
   conditionName=0x11f5d70 "!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
   errorType=0x11f5d60 "FailedAssertion", fileName=0x11f5d00 "snapbuild.c", lineNumber=385)
   at assert.c:46
#3  0x0000000000b5d7d2 in AllocateSnapshotBuilder (reorder=0x35795d8, xmin_horizon=0,
   start_lsn=1111114144, need_full_snapshot=false) at snapbuild.c:385
#4  0x0000000000b4d4c2 in StartupDecodingContext (output_plugin_options=0x0, start_lsn=1111114144,
   xmin_horizon=0, need_full_snapshot=false, fast_forward=false,
   read_page=0xb4f5a5 ,
   prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
   update_progress=0x0) at logical.c:191
#5  0x0000000000b4de0a in CreateDecodingContext (start_lsn=1111114144, output_plugin_options=0x0,
   fast_forward=false, read_page=0xb4f5a5 ,
   prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
   update_progress=0x0) at logical.c:486
#6  0x0000000000b4fcb7 in pg_logical_slot_get_changes_guts (fcinfo=0x344fbe8, confirm=true, binary=false)
   at logicalfuncs.c:260
#7  0x0000000000b500b1 in pg_logical_slot_get_changes (fcinfo=0x344fbe8) at logicalfuncs.c:395
==>

In my view, this bug can be fix by using fix.patch in attachment.

Regards.

---- Replied Message ----
From PG Bug reporting form<noreply@postgresql.org>
Date 8/15/2023 00:05
To <pgsql-bugs@lists.postgresql.org>
Cc <ocean_li_996@163.com>
Subject BUG #18055: logical decoding core on AllocateSnapshotBuilder()
The following bug has been logged on the website:

Bug reference:      18055
Logged by:          ocean li
Email address:      ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system:   centos7 5.10.84 x86_64
Description:        

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
==>
#0  0x00007f744a7b9277 in raise () from /lib64/libc.so.6
#1  0x00007f744a7ba968 in abort () from /lib64/libc.so.6
#2  0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58
"!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c",
lineNumber=381) at assert.c:46
#3  0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98,
xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at
snapbuild.c:381
#4  0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0,
start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false,
fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>,
prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24
<LogicalOutputWrite>, update_progress=0x0) at logical.c:191
#5  0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216,
output_plugin_options=0x0, fast_forward=false, read_page=0xe53023
<logical_read_local_xlog_page>, prepare_write=0xe52df6
<LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>,
update_progress=0x0) at logical.c:486
#6  0x0000000000e53735 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259
#7  0x0000000000e53b1c in pg_logical_slot_get_changes
(fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393
#8  0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0)
at fmgr.c:2262
...
==>
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of  cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout*  logged
before each core occurred. (For some reason, I can't provide the information
of log)

Attachment

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Mon, Aug 14, 2023 at 11:06 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
> the core, as far as i know, maybe caused by this way: an ERROR raised when
> calling *pg_logical_slot_get_changes_guts* function. The code part of
> PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> Unfortunately, I couldn't find a minimal reproduction case. However, I
> observed an *ERROR: canceling statement due to statement timeout*  logged
> before each core occurred. (For some reason, I can't provide the information
> of log)
>

Thanks for the report. I think your analysis is correct and I see that
you have proposed a patch in the next email. I'll look into it and
share my feedback.

--
With Regards,
Amit Kapila.



RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> 
> The following bug has been logged on the website:
> 
> Bug reference:      18055
> Logged by:          ocean li
> Email address:      ocean_li_996@163.com
> PostgreSQL version: 11.9
> Operating system:   centos7 5.10.84 x86_64
> Description:
> 
> For testing logical decoding module, *pg_logical_slot_get_changes* function
> is used. Sometimes i got an core whose stack was like that:
> ...
> And in level #3 of stack above, NInitialRunningXacts is 2 and
> InitialRunningXacts is not NULL observed in one of  cores.
> 
> Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> core, as far as i know, maybe caused by this way: an ERROR raised when calling
> *pg_logical_slot_get_changes_guts* function. The code part of
> PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> Unfortunately, I couldn't find a minimal reproduction case. However, I
> observed an *ERROR: canceling statement due to statement timeout*  logged
> before each core occurred. (For some reason, I can't provide the information of
> log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).

About the patch:

---
        /* free context, call shutdown callback */
        FreeDecodingContext(ctx);

        ReplicationSlotRelease();
        InvalidateSystemCaches();
    }
     PG_CATCH();
     {
+
+        /* free context, call shutdown callback */
+        FreeDecodingContext(ctx);
+
+        ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.

Just share one small patch based on your fix for reference, feel free to update it.

[1]
Session 1:
DROP TABLE IF EXISTS tbl1;
CREATE TABLE tbl1 (val1 integer, val2 integer);
CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
BEGIN;
SAVEPOINT sp1;
INSERT INTO tbl1 VALUES (1);
    Session 2:
    CHECKPOINT;
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
Session 1:
TRUNCATE tbl1;
COMMIT;
BEGIN;
INSERT INTO tbl1 VALUES (1);
    Session 2:
    CHECKPOINT;
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
Session 1:
COMMIT;

    Session 2:
    -- Use gdb or modify the code to make the first function call fail.
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');

    -- Since the previous command failed and the NInitialRunningXacts was not reset, we should see the Assert() failure
thistime.
 
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');

Best Regards,
Hou zj

Attachment

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference:      18055
> > Logged by:          ocean li
> > Email address:      ocean_li_996@163.com
> > PostgreSQL version: 11.9
> > Operating system:   centos7 5.10.84 x86_64
> > Description:
> >
> > For testing logical decoding module, *pg_logical_slot_get_changes* function
> > is used. Sometimes i got an core whose stack was like that:
> > ...
> > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > InitialRunningXacts is not NULL observed in one of  cores.
> >
> > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> > core, as far as i know, maybe caused by this way: an ERROR raised when calling
> > *pg_logical_slot_get_changes_guts* function. The code part of
> > PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> > Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> > Unfortunately, I couldn't find a minimal reproduction case. However, I
> > observed an *ERROR: canceling statement due to statement timeout*  logged
> > before each core occurred. (For some reason, I can't provide the information of
> > log)
>
> Thanks for the report and the fix!
>
> I can also reproduce by the steps[1] in PG15~11(Note that we need to change
> LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
> records).
>
> About the patch:
>
> ---
>                 /* free context, call shutdown callback */
>                 FreeDecodingContext(ctx);
>
>                 ReplicationSlotRelease();
>                 InvalidateSystemCaches();
>         }
>         PG_CATCH();
>         {
> +
> +               /* free context, call shutdown callback */
> +               FreeDecodingContext(ctx);
> +
> +               ReplicationSlotRelease();
>
> I think we could not directly call the cleanup functions here. There could be
> two risks:
> 1) it's possible either 'ctx' hasn't been initialized before the error happen,
> we don't need to clean it up in this case
> 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
> before entering the catch() block which means we will double cleanup(free) it.
>
> To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
> in it and do the null check for 'ctx' before cleaning up.
>
> Just share one small patch based on your fix for reference, feel free to update it.
>

- LogicalDecodingContext *ctx;
+ LogicalDecodingContext *ctx = NULL;

Don't we need to use volatile for ctx? See the following comment in
elog.h: "Note: if a local variable of the function containing PG_TRY
is modified in the PG_TRY section and used in the PG_CATCH section,
that variable must be declared "volatile" for POSIX compliance."?

--
With Regards,
Amit Kapila.



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> > >
> > > The following bug has been logged on the website:
> > >
> > > Bug reference:      18055
> > > Logged by:          ocean li
> > > Email address:      ocean_li_996@163.com
> > > PostgreSQL version: 11.9
> > > Operating system:   centos7 5.10.84 x86_64
> > > Description:
> > >
> > > For testing logical decoding module, *pg_logical_slot_get_changes* function
> > > is used. Sometimes i got an core whose stack was like that:
> > > ...
> > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > InitialRunningXacts is not NULL observed in one of  cores.
> > >
> > > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> > > core, as far as i know, maybe caused by this way: an ERROR raised when calling
> > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> > > Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> > > Unfortunately, I couldn't find a minimal reproduction case. However, I
> > > observed an *ERROR: canceling statement due to statement timeout*  logged
> > > before each core occurred. (For some reason, I can't provide the information of
> > > log)
> >
> > Thanks for the report and the fix!
> >
> > I can also reproduce by the steps[1] in PG15~11(Note that we need to change
> > LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
> > records).
> >
> > About the patch:
> >
> > ---
> >                 /* free context, call shutdown callback */
> >                 FreeDecodingContext(ctx);
> >
> >                 ReplicationSlotRelease();
> >                 InvalidateSystemCaches();
> >         }
> >         PG_CATCH();
> >         {
> > +
> > +               /* free context, call shutdown callback */
> > +               FreeDecodingContext(ctx);
> > +
> > +               ReplicationSlotRelease();
> >
> > I think we could not directly call the cleanup functions here. There could be
> > two risks:
> > 1) it's possible either 'ctx' hasn't been initialized before the error happen,
> > we don't need to clean it up in this case
> > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
> > before entering the catch() block which means we will double cleanup(free) it.
> >
> > To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
> > in it and do the null check for 'ctx' before cleaning up.
> >
> > Just share one small patch based on your fix for reference, feel free to update it.
> >
>
> - LogicalDecodingContext *ctx;
> + LogicalDecodingContext *ctx = NULL;
>
> Don't we need to use volatile for ctx? See the following comment in
> elog.h: "Note: if a local variable of the function containing PG_TRY
> is modified in the PG_TRY section and used in the PG_CATCH section,
> that variable must be declared "volatile" for POSIX compliance."?
>

BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

--
With Regards,
Amit Kapila.



RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > > >
> > > > The following bug has been logged on the website:
> > > >
> > > > Bug reference:      18055
> > > > Logged by:          ocean li
> > > > Email address:      ocean_li_996@163.com
> > > > PostgreSQL version: 11.9
> > > > Operating system:   centos7 5.10.84 x86_64
> > > > Description:
> > > >
> > > > For testing logical decoding module, *pg_logical_slot_get_changes*
> > > > function is used. Sometimes i got an core whose stack was like that:
> > > > ...
> > > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > > InitialRunningXacts is not NULL observed in one of  cores.
> > > >
> > > > Using of NInitialRunningXacts and InitialRunningXacts are clear.
> > > > Currently, the core, as far as i know, maybe caused by this way:
> > > > an ERROR raised when calling
> > > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > > PG_CATCH() doses not reset  NInitialRunningXacts and
> InitialRunningXacts.
> > > > Then, calling pg_logical_slot_get_changes_guts again, the core may
> occur.
> > > > Unfortunately, I couldn't find a minimal reproduction case.
> > > > However, I observed an *ERROR: canceling statement due to
> > > > statement timeout*  logged before each core occurred. (For some
> > > > reason, I can't provide the information of
> > > > log)
> > >
> > > Thanks for the report and the fix!
> > >
> > > I can also reproduce by the steps[1] in PG15~11(Note that we need to
> > > change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra
> > > running xacts wal records).
> > >
> > > About the patch:
> > >
> > > ---
> > >                 /* free context, call shutdown callback */
> > >                 FreeDecodingContext(ctx);
> > >
> > >                 ReplicationSlotRelease();
> > >                 InvalidateSystemCaches();
> > >         }
> > >         PG_CATCH();
> > >         {
> > > +
> > > +               /* free context, call shutdown callback */
> > > +               FreeDecodingContext(ctx);
> > > +
> > > +               ReplicationSlotRelease();
> > >
> > > I think we could not directly call the cleanup functions here. There
> > > could be two risks:
> > > 1) it's possible either 'ctx' hasn't been initialized before the
> > > error happen, we don't need to clean it up in this case
> > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
> > > normally before entering the catch() block which means we will double
> cleanup(free) it.
> > >
> > > To improve this, I think we can use PG_FINALLY() here and move all
> > > these cleanup functions in it and do the null check for 'ctx' before cleaning
> up.
> > >
> > > Just share one small patch based on your fix for reference, feel free to
> update it.
> > >
> >
> > - LogicalDecodingContext *ctx;
> > + LogicalDecodingContext *ctx = NULL;
> >
> > Don't we need to use volatile for ctx? See the following comment in
> > elog.h: "Note: if a local variable of the function containing PG_TRY
> > is modified in the PG_TRY section and used in the PG_CATCH section,
> > that variable must be declared "volatile" for POSIX compliance."?
> >

Thanks for pointing it out, I think we need to use volatile for it.

> 
> BTW, don't we need a similar change in pg_logical_replication_slot_advance()?

You are right, I can reproduce the same following the steps in [1] and
replace the last two pg_logical_slot_get_changes function calls with two
"SELECT pg_replication_slot_advance('isolation_slot', (select
pg_current_wal_lsn()));"

Here is a new version patch based on PG15 STABLE which addressed above comments.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57160A7DD10B517AFDC73378941AA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best Regards,
Hou zj

Attachment

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Masahiko Sawada
Date:
On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> > >
> > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form
> > <noreply@postgresql.org> wrote:
> > > > >
> > > > > The following bug has been logged on the website:
> > > > >
> > > > > Bug reference:      18055
> > > > > Logged by:          ocean li
> > > > > Email address:      ocean_li_996@163.com
> > > > > PostgreSQL version: 11.9
> > > > > Operating system:   centos7 5.10.84 x86_64
> > > > > Description:
> > > > >
> > > > > For testing logical decoding module, *pg_logical_slot_get_changes*
> > > > > function is used. Sometimes i got an core whose stack was like that:
> > > > > ...
> > > > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > > > InitialRunningXacts is not NULL observed in one of  cores.
> > > > >
> > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear.
> > > > > Currently, the core, as far as i know, maybe caused by this way:
> > > > > an ERROR raised when calling
> > > > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > > > PG_CATCH() doses not reset  NInitialRunningXacts and
> > InitialRunningXacts.
> > > > > Then, calling pg_logical_slot_get_changes_guts again, the core may
> > occur.
> > > > > Unfortunately, I couldn't find a minimal reproduction case.
> > > > > However, I observed an *ERROR: canceling statement due to
> > > > > statement timeout*  logged before each core occurred. (For some
> > > > > reason, I can't provide the information of
> > > > > log)
> > > >
> > > > Thanks for the report and the fix!
> > > >
> > > > I can also reproduce by the steps[1] in PG15~11(Note that we need to
> > > > change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra
> > > > running xacts wal records).
> > > >
> > > > About the patch:
> > > >
> > > > ---
> > > >                 /* free context, call shutdown callback */
> > > >                 FreeDecodingContext(ctx);
> > > >
> > > >                 ReplicationSlotRelease();
> > > >                 InvalidateSystemCaches();
> > > >         }
> > > >         PG_CATCH();
> > > >         {
> > > > +
> > > > +               /* free context, call shutdown callback */
> > > > +               FreeDecodingContext(ctx);
> > > > +
> > > > +               ReplicationSlotRelease();
> > > >
> > > > I think we could not directly call the cleanup functions here. There
> > > > could be two risks:
> > > > 1) it's possible either 'ctx' hasn't been initialized before the
> > > > error happen, we don't need to clean it up in this case
> > > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
> > > > normally before entering the catch() block which means we will double
> > cleanup(free) it.
> > > >
> > > > To improve this, I think we can use PG_FINALLY() here and move all
> > > > these cleanup functions in it and do the null check for 'ctx' before cleaning
> > up.
> > > >
> > > > Just share one small patch based on your fix for reference, feel free to
> > update it.
> > > >
> > >
> > > - LogicalDecodingContext *ctx;
> > > + LogicalDecodingContext *ctx = NULL;
> > >
> > > Don't we need to use volatile for ctx? See the following comment in
> > > elog.h: "Note: if a local variable of the function containing PG_TRY
> > > is modified in the PG_TRY section and used in the PG_CATCH section,
> > > that variable must be declared "volatile" for POSIX compliance."?
> > >
>
> Thanks for pointing it out, I think we need to use volatile for it.
>
> >
> > BTW, don't we need a similar change in pg_logical_replication_slot_advance()?
>
> You are right, I can reproduce the same following the steps in [1] and
> replace the last two pg_logical_slot_get_changes function calls with two
> "SELECT pg_replication_slot_advance('isolation_slot', (select
> pg_current_wal_lsn()));"

I've not tested but the same is true in a case where
pg_create_logical_replication_slot() raises an error (e.g., cancelled
while DecodingContextFindStartpoint()) after creating the decoding
context?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Thursday, August 17, 2023 10:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, August 17, 2023 6:07 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila
> > > <amit.kapila16@gmail.com>
> > > wrote:
> > > >
> > > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu)
> > > > <houzj.fnst@fujitsu.com> wrote:
> > > > >
> > > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form
> > > <noreply@postgresql.org> wrote:
> > > > > >
> > > > > > The following bug has been logged on the website:
> > > > > >
> > > > > > Bug reference:      18055
> > > > > > Logged by:          ocean li
> > > > > > Email address:      ocean_li_996@163.com
> > > > > > PostgreSQL version: 11.9
> > > > > > Operating system:   centos7 5.10.84 x86_64
> > > > > > Description:
> > > > > >
> > > > > > For testing logical decoding module,
> > > > > > *pg_logical_slot_get_changes* function is used. Sometimes i got an
> core whose stack was like that:
> > > > > > ...
> > > > > > And in level #3 of stack above, NInitialRunningXacts is 2 and
> > > > > > InitialRunningXacts is not NULL observed in one of  cores.
> > > > > >
> > > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear.
> > > > > > Currently, the core, as far as i know, maybe caused by this way:
> > > > > > an ERROR raised when calling
> > > > > > *pg_logical_slot_get_changes_guts* function. The code part of
> > > > > > PG_CATCH() doses not reset  NInitialRunningXacts and
> > > InitialRunningXacts.
> > > > > > Then, calling pg_logical_slot_get_changes_guts again, the core
> > > > > > may
> > > occur.
> > > > > > Unfortunately, I couldn't find a minimal reproduction case.
> > > > > > However, I observed an *ERROR: canceling statement due to
> > > > > > statement timeout*  logged before each core occurred. (For
> > > > > > some reason, I can't provide the information of
> > > > > > log)
> > > > >
> > > > > Thanks for the report and the fix!
> > > > >
> > > > > I can also reproduce by the steps[1] in PG15~11(Note that we
> > > > > need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to
> > > > > avoid extra running xacts wal records).
> > > > >
> > > > > About the patch:
> > > > >
> > > > > ---
> > > > >                 /* free context, call shutdown callback */
> > > > >                 FreeDecodingContext(ctx);
> > > > >
> > > > >                 ReplicationSlotRelease();
> > > > >                 InvalidateSystemCaches();
> > > > >         }
> > > > >         PG_CATCH();
> > > > >         {
> > > > > +
> > > > > +               /* free context, call shutdown callback */
> > > > > +               FreeDecodingContext(ctx);
> > > > > +
> > > > > +               ReplicationSlotRelease();
> > > > >
> > > > > I think we could not directly call the cleanup functions here.
> > > > > There could be two risks:
> > > > > 1) it's possible either 'ctx' hasn't been initialized before the
> > > > > error happen, we don't need to clean it up in this case
> > > > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up
> > > > > normally before entering the catch() block which means we will
> > > > > double
> > > cleanup(free) it.
> > > > >
> > > > > To improve this, I think we can use PG_FINALLY() here and move
> > > > > all these cleanup functions in it and do the null check for
> > > > > 'ctx' before cleaning
> > > up.
> > > > >
> > > > > Just share one small patch based on your fix for reference, feel
> > > > > free to
> > > update it.
> > > > >
> > > >
> > > > - LogicalDecodingContext *ctx;
> > > > + LogicalDecodingContext *ctx = NULL;
> > > >
> > > > Don't we need to use volatile for ctx? See the following comment
> > > > in
> > > > elog.h: "Note: if a local variable of the function containing
> > > > PG_TRY is modified in the PG_TRY section and used in the PG_CATCH
> > > > section, that variable must be declared "volatile" for POSIX compliance."?
> > > >
> >
> > Thanks for pointing it out, I think we need to use volatile for it.
> >
> > >
> > > BTW, don't we need a similar change in
> pg_logical_replication_slot_advance()?
> >
> > You are right, I can reproduce the same following the steps in [1] and
> > replace the last two pg_logical_slot_get_changes function calls with
> > two "SELECT pg_replication_slot_advance('isolation_slot', (select
> > pg_current_wal_lsn()));"
> 
> I've not tested but the same is true in a case where
> pg_create_logical_replication_slot() raises an error (e.g., cancelled while
> DecodingContextFindStartpoint()) after creating the decoding context?

Indeed, I can reproduce this case[1] as well, thanks!

Here is a new version patch which addressed this based on PG15.

Apart from this, I am thinking we’d better also apply the fix on HEAD even if
it doesn't have such flag and Assert. Because in error case(HEAD), we will skip
calling shutdown callbacks for output plugins, so it could cause other
problems. For example, in pgoutput we free and reset the hash table in shutdown
callbacks, if we skip this, and next time we consume changes from another slot,
we would use an expired hash table which can cause problems. I can provide
patches for all branches if the fix is OK.


[1]
~~ reproduction steps (Need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number)

session 1:
-----
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
CREATE TABLE test(a int);
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Note, to let the current slot restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

session 1:
-----
- The current backend will serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
-----

session 3:
-----
- Release the process and we can see if restore the snaphost and set NInitialRunningXacts to 1.
- Then let it report an ERROR and try to create the slot again after ERROR.
 SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

Best Regards,
Hou zj







Attachment

Re:RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
ocean_li_996
Date:
Excited! Thanks for the discussion and  the fix!

The bug seems to be solved well. And usage of PG_FINALLY can make it more elegantly.  However, for the version lacking of  PG_FINALLY, should the patch be written like that:

---
/* free context, call shutdown callback */
FreeDecodingContext(ctx);

ReplicationSlotRelease();
InvalidateSystemCaches();
}
PG_CATCH();
{
+
+    /* free context, call shutdown callback */
+     if (ctx != NULL)
+             FreeDecodingContext(ctx);
+
+    ReplicationSlotRelease();

----


Best regards!



At 2023-08-18 12:21:53, "Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com> wrote: Here is a new version patch which addressed this based on PG15.

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Andres Freund
Date:
Hi,

On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:
> From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
> From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
> Date: Thu, 17 Aug 2023 19:29:34 +0800
> Subject: [PATCH v4] cleanup decoding context in error cases
>
> Some of the management functions for logical decoding didn't clean up the
> decoding context when an error occurs during decoding. This can
> result in accessing unfreed resources and assert failure the next time we call
> these functions. Fix it by cleaning up the decoding context and slots in error
> cases.

I don't think this is the right fix - at all. We shouldn't run arbitrary code
after a failure, which we do by calling the shutdown callback. It's not going
to expect to be called in a context where you're not allowed to do very much.

Consider what would happen if FreeDecodingContext() or such failed - we'd not
have executed InvalidateSystemCaches(), and therefore would continue with
corrupted system caches.

There's also no need to free memory here, that's already done via the memory
context cleanup of the "surrounding" memory context (i.e. CurrentMemoryContext
when StartupDecodingContext() is called)


I think the real problem is that 272248a0c added InitialRunningXacts a global
variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
be an issue? The commit justified this choice with avoiding an ABI breakage -
but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
disk (that's SnapBuildOnDisk). There's no ABI to preserve?

Greetings,

Andres Freund



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Masahiko Sawada
Date:
On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>
> I think the real problem is that 272248a0c added InitialRunningXacts a global
> variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
> be an issue? The commit justified this choice with avoiding an ABI breakage -
> but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
> disk (that's SnapBuildOnDisk).

No, since SnapBuildOnDisk contains SnapBuild, if we add something to
SnapBuild, the on-disk format compatibility would break. See:

/*
 * We store current state of struct SnapBuild on disk in the following manner:
 *
 * struct SnapBuildOnDisk;
 * TransactionId * running.xcnt_space;
 * TransactionId * committed.xcnt; (*not xcnt_space*)
 *
 */
typedef struct SnapBuildOnDisk
{
    /* first part of this struct needs to be version independent */

    /* data not covered by checksum */
    uint32      magic;
    pg_crc32c   checksum;

    /* data covered by checksum */

    /* version, in case we want to support pg_upgrade */
    uint32      version;
    /* how large is the on disk data, excluding the constant sized part */
    uint32      length;

    /* version dependent part */
    SnapBuild   builder;
:

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:
> > From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
> > From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
> > Date: Thu, 17 Aug 2023 19:29:34 +0800
> > Subject: [PATCH v4] cleanup decoding context in error cases
> >
> > Some of the management functions for logical decoding didn't clean up the
> > decoding context when an error occurs during decoding. This can
> > result in accessing unfreed resources and assert failure the next time we call
> > these functions. Fix it by cleaning up the decoding context and slots in error
> > cases.
>
> I don't think this is the right fix - at all. We shouldn't run arbitrary code
> after a failure, which we do by calling the shutdown callback.
>

But OTOH, it can prevent freeing some global memory like in the case
of pgoutput_shutdown which frees some memory allocated in
CacheMemoryContext. Also, as pointed out by Hou-San[1], it can lead to
unwanted behavior as next time we can access some invalid entries.

The other alternatives to fix are (a) Have some Reset* kind of
function (similar to ResetReindexState) to reset the required
variables and call at AbortTransaction time. The minor risk could be
that if any extension uses the same function name as ours but not sure
if it is worth worrying about. (b) Initialize the required variables
in AllocateSnapshotBuilder().

[1] -
https://www.postgresql.org/message-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Andres Freund
Date:
Hi,

On 2023-08-21 21:41:46 +0900, Masahiko Sawada wrote:
> On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> >
> > I think the real problem is that 272248a0c added InitialRunningXacts a global
> > variable. If it just lived in in struct SnapBuild, this whole thing wouldn't
> > be an issue? The commit justified this choice with avoiding an ABI breakage -
> > but isn't that bogus? The struct is private to snapbuild.c. It doesn't live on
> > disk (that's SnapBuildOnDisk).
> 
> No, since SnapBuildOnDisk contains SnapBuild, if we add something to
> SnapBuild, the on-disk format compatibility would break. See:

Err, yes. That was a brainfart on my part. But we could still just have
handled that via the 'version' field.

Greetings,

Andres Freund



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Andres Freund
Date:
Hi,

On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:
> On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:
> > > From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
> > > From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
> > > Date: Thu, 17 Aug 2023 19:29:34 +0800
> > > Subject: [PATCH v4] cleanup decoding context in error cases
> > >
> > > Some of the management functions for logical decoding didn't clean up the
> > > decoding context when an error occurs during decoding. This can
> > > result in accessing unfreed resources and assert failure the next time we call
> > > these functions. Fix it by cleaning up the decoding context and slots in error
> > > cases.
> >
> > I don't think this is the right fix - at all. We shouldn't run arbitrary code
> > after a failure, which we do by calling the shutdown callback.
> >
> 
> But OTOH, it can prevent freeing some global memory like in the case
> of pgoutput_shutdown which frees some memory allocated in
> CacheMemoryContext. Also, as pointed out by Hou-San[1], it can lead to
> unwanted behavior as next time we can access some invalid entries.

To me the code in pgoutput is just bad.  For one, it uses a global variable
for non-global state - instead it should store the data in
LogicalDecodingContext->output_plugin_private. Secondly, it should not use
CacheMemoryContext, given that the cache apparently is local to a individual
pgoutput instance.


But: Is there actually a danger of invalid entries being accessed? It looks to
me like pgoutput uses invalidation callbacks etc, which would prevent the
cache from being stale?


> The other alternatives to fix are (a) Have some Reset* kind of
> function (similar to ResetReindexState) to reset the required
> variables and call at AbortTransaction time.

-1. This makes things global concerns that shouldn't be.

If we really need something to clean this up, I'd look at
MemoryContextRegisterResetCallback().

Greetings,

Andres Freund



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:
> > On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:
> > > > From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17 00:00:00 2001
> > > > From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
> > > > Date: Thu, 17 Aug 2023 19:29:34 +0800
> > > > Subject: [PATCH v4] cleanup decoding context in error cases
> > > >
> > > > Some of the management functions for logical decoding didn't clean up the
> > > > decoding context when an error occurs during decoding. This can
> > > > result in accessing unfreed resources and assert failure the next time we call
> > > > these functions. Fix it by cleaning up the decoding context and slots in error
> > > > cases.
> > >
> > > I don't think this is the right fix - at all. We shouldn't run arbitrary code
> > > after a failure, which we do by calling the shutdown callback.
> > >
> >
> > But OTOH, it can prevent freeing some global memory like in the case
> > of pgoutput_shutdown which frees some memory allocated in
> > CacheMemoryContext. Also, as pointed out by Hou-San[1], it can lead to
> > unwanted behavior as next time we can access some invalid entries.
>
> To me the code in pgoutput is just bad.  For one, it uses a global variable
> for non-global state - instead it should store the data in
> LogicalDecodingContext->output_plugin_private. Secondly, it should not use
> CacheMemoryContext, given that the cache apparently is local to a individual
> pgoutput instance.
>

I agree that this is a better way and we should change the code. I
think we can probably allocate this in RelationSyncCache.

>
> But: Is there actually a danger of invalid entries being accessed? It looks to
> me like pgoutput uses invalidation callbacks etc, which would prevent the
> cache from being stale?
>

Actually, during invalidation, we simply mark entries as invalid, and
during the next access, we validate those by freeing the old stuff and
then build it from scratch. I feel if we do what you said in the
previous point this shouldn't be a problem either.

>
> > The other alternatives to fix are (a) Have some Reset* kind of
> > function (similar to ResetReindexState) to reset the required
> > variables and call at AbortTransaction time.
>
> -1. This makes things global concerns that shouldn't be.
>
> If we really need something to clean this up, I'd look at
> MemoryContextRegisterResetCallback().
>

+1. This sounds like a better idea.

--
With Regards,
Amit Kapila.



RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Wednesday, August 23, 2023 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
> wrote:
> >
> > On 2023-08-21 19:07:16 +0530, Amit Kapila wrote:
> > > On Mon, Aug 21, 2023 at 2:35 AM Andres Freund <andres@anarazel.de>
> wrote:
> > > >
> > > > On 2023-08-18 04:21:53 +0000, Zhijie Hou (Fujitsu) wrote:
> > > > > From ee1dfccc0306812c356c84bbd7e2558f27d7d348 Mon Sep 17
> > > > > 00:00:00 2001
> > > > > From: Hou Zhijie <houzj.fnst@cn.fujitsu.com>
> > > > > Date: Thu, 17 Aug 2023 19:29:34 +0800
> > > > > Subject: [PATCH v4] cleanup decoding context in error cases
> > > > >
> > > > > Some of the management functions for logical decoding didn't
> > > > > clean up the decoding context when an error occurs during
> > > > > decoding. This can result in accessing unfreed resources and
> > > > > assert failure the next time we call these functions. Fix it by
> > > > > cleaning up the decoding context and slots in error cases.
> > > >
> > > > I don't think this is the right fix - at all. We shouldn't run
> > > > arbitrary code after a failure, which we do by calling the shutdown
> callback.
> > > >
> > >
> >
> > > The other alternatives to fix are (a) Have some Reset* kind of
> > > function (similar to ResetReindexState) to reset the required
> > > variables and call at AbortTransaction time.
> >
> > -1. This makes things global concerns that shouldn't be.
> >
> > If we really need something to clean this up, I'd look at
> > MemoryContextRegisterResetCallback().
> >
> 
> +1. This sounds like a better idea.

Here is a small patch based on this idea.
I registered the callback under snapshot builder context
and the callback will reset the variables.

Best Regards,
Hou zj
 


Attachment

RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, August 22, 2023 2:19 AM Andres Freund <andres@anarazel.de> wrote:
> 
> Hi,
> 
> On 2023-08-21 21:41:46 +0900, Masahiko Sawada wrote:
> > On Mon, Aug 21, 2023 at 6:05 AM Andres Freund <andres@anarazel.de>
> wrote:
> > >
> > > Hi,
> > >
> > >
> > > I think the real problem is that 272248a0c added InitialRunningXacts
> > > a global variable. If it just lived in in struct SnapBuild, this
> > > whole thing wouldn't be an issue? The commit justified this choice
> > > with avoiding an ABI breakage - but isn't that bogus? The struct is
> > > private to snapbuild.c. It doesn't live on disk (that's SnapBuildOnDisk).
> >
> > No, since SnapBuildOnDisk contains SnapBuild, if we add something to
> > SnapBuild, the on-disk format compatibility would break. See:
> 
> Err, yes. That was a brainfart on my part. But we could still just have handled that
> via the 'version' field.

I thought you mean we can add the InitialRunningXacts into the SnapBuild struct
while avoid serializing them to the disk, so that the disk file size doesn't
change and the reported problem also get fixed.

I think, normally, we need to increment the snapbuild version when changing the
struct size, but this sounds a bit complex as we need to increment the number in all
support back-branches which have different SNAPBUILD_VERSION.

Or if we agree we don't change the version number in this case as we don't plan
to change the disk file size(only change the struct size), it seems work,
although it looks a bit hacky.

Best Regards,
Hou zj

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, August 23, 2023 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
> > wrote:
> > >
> > > -1. This makes things global concerns that shouldn't be.
> > >
> > > If we really need something to clean this up, I'd look at
> > > MemoryContextRegisterResetCallback().
> > >
> >
> > +1. This sounds like a better idea.
>
> Here is a small patch based on this idea.
> I registered the callback under snapshot builder context
> and the callback will reset the variables.
>

LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to
proceed with this fix.

--
With Regards,
Amit Kapila.



RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
"Zhijie Hou (Fujitsu)"
Date:
On Tuesday, September 5, 2023 7:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Wednesday, August 23, 2023 4:47 PM Amit Kapila
> <amit.kapila16@gmail.com> wrote:
> > >
> > > On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
> > > wrote:
> > > >
> > > > -1. This makes things global concerns that shouldn't be.
> > > >
> > > > If we really need something to clean this up, I'd look at
> > > > MemoryContextRegisterResetCallback().
> > > >
> > >
> > > +1. This sounds like a better idea.
> >
> > Here is a small patch based on this idea.
> > I registered the callback under snapshot builder context and the
> > callback will reset the variables.
> >
> 
> LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to proceed
> with this fix.

Thanks, attach the patches for all support branches. I have confirmed that they can
pass the regression test and fix the issue.

Best Regards,
Hou zj

Attachment

Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Thu, Sep 7, 2023 at 5:40 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, September 5, 2023 7:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Wednesday, August 23, 2023 4:47 PM Amit Kapila
> > <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
> > > > wrote:
> > > > >
> > > > > -1. This makes things global concerns that shouldn't be.
> > > > >
> > > > > If we really need something to clean this up, I'd look at
> > > > > MemoryContextRegisterResetCallback().
> > > > >
> > > >
> > > > +1. This sounds like a better idea.
> > >
> > > Here is a small patch based on this idea.
> > > I registered the callback under snapshot builder context and the
> > > callback will reset the variables.
> > >
> >
> > LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to proceed
> > with this fix.
>
> Thanks, attach the patches for all support branches. I have confirmed that they can
> pass the regression test and fix the issue.
>

I have also verified and it works for me as well. I'll push this (from
PG11-PG15) next week by Tuesday unless there are more comments.

--
With Regards,
Amit Kapila.



Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()

From
Amit Kapila
Date:
On Sat, Sep 9, 2023 at 5:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 5:40 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, September 5, 2023 7:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Aug 25, 2023 at 9:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > > wrote:
> > > >
> > > > On Wednesday, August 23, 2023 4:47 PM Amit Kapila
> > > <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 21, 2023 at 11:57 PM Andres Freund <andres@anarazel.de>
> > > > > wrote:
> > > > > >
> > > > > > -1. This makes things global concerns that shouldn't be.
> > > > > >
> > > > > > If we really need something to clean this up, I'd look at
> > > > > > MemoryContextRegisterResetCallback().
> > > > > >
> > > > >
> > > > > +1. This sounds like a better idea.
> > > >
> > > > Here is a small patch based on this idea.
> > > > I registered the callback under snapshot builder context and the
> > > > callback will reset the variables.
> > > >
> > >
> > > LGTM. Unless Andres or Sawada-San thinks otherwise, I would like to proceed
> > > with this fix.
> >
> > Thanks, attach the patches for all support branches. I have confirmed that they can
> > pass the regression test and fix the issue.
> >
>
> I have also verified and it works for me as well. I'll push this (from
> PG11-PG15) next week by Tuesday unless there are more comments.
>

Pushed.

--
With Regards,
Amit Kapila.