Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs

From ocean_li_996
Subject Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Date
Msg-id 12586894.d69f.189f501073a.Coremail.ocean_li_996@163.com
Whole thread Raw
In response to BUG #18055: logical decoding core on AllocateSnapshotBuilder()  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
font{ line-height: 1.6; } ul,ol{ padding-left: 20px; list-style-position: inside; }
By applying the test.patch in attachment, I got similar core stacks following the procedure:
==>
postgres=# alter system set wal_level to 'logical';
-- restart server
postgres=# set statement_timeout to 1000;
SET
postgres=# select pg_create_logical_replication_slot('test', 'test_decoding', 0);
WARNING:  NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
pg_create_logical_replication_slot
------------------------------------
(test,0/423999D8)
(1 row)
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING:  NInitialRunningXacts is 0 , InitialRunningXacts is (nil)
ERROR:  canceling statement due to statement timeout
postgres=# select pg_logical_slot_get_changes('test', NULL, NULL);
WARNING:  NInitialRunningXacts is 3 , InitialRunningXacts is 0x373bc18
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
==>

And core stacks got above are
==>
(gdb) bt
#0  0x00007fbf2c987277 in raise () from /lib64/libc.so.6
#1  0x00007fbf2c988968 in abort () from /lib64/libc.so.6
#2  0x0000000000deb38e in ExceptionalCondition (
   conditionName=0x11f5d70 "!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
   errorType=0x11f5d60 "FailedAssertion", fileName=0x11f5d00 "snapbuild.c", lineNumber=385)
   at assert.c:46
#3  0x0000000000b5d7d2 in AllocateSnapshotBuilder (reorder=0x35795d8, xmin_horizon=0,
   start_lsn=1111114144, need_full_snapshot=false) at snapbuild.c:385
#4  0x0000000000b4d4c2 in StartupDecodingContext (output_plugin_options=0x0, start_lsn=1111114144,
   xmin_horizon=0, need_full_snapshot=false, fast_forward=false,
   read_page=0xb4f5a5 ,
   prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
   update_progress=0x0) at logical.c:191
#5  0x0000000000b4de0a in CreateDecodingContext (start_lsn=1111114144, output_plugin_options=0x0,
   fast_forward=false, read_page=0xb4f5a5 ,
   prepare_write=0xb4f378 , do_write=0xb4f3a6 ,
   update_progress=0x0) at logical.c:486
#6  0x0000000000b4fcb7 in pg_logical_slot_get_changes_guts (fcinfo=0x344fbe8, confirm=true, binary=false)
   at logicalfuncs.c:260
#7  0x0000000000b500b1 in pg_logical_slot_get_changes (fcinfo=0x344fbe8) at logicalfuncs.c:395
==>

In my view, this bug can be fix by using fix.patch in attachment.

Regards.

---- Replied Message ----
From PG Bug reporting form<noreply@postgresql.org>
Date 8/15/2023 00:05
To <pgsql-bugs@lists.postgresql.org>
Cc <ocean_li_996@163.com>
Subject BUG #18055: logical decoding core on AllocateSnapshotBuilder()
The following bug has been logged on the website:

Bug reference:      18055
Logged by:          ocean li
Email address:      ocean_li_996@163.com
PostgreSQL version: 11.9
Operating system:   centos7 5.10.84 x86_64
Description:        

For testing logical decoding module, *pg_logical_slot_get_changes* function
is used. Sometimes i got an core whose stack was like that:
==>
#0  0x00007f744a7b9277 in raise () from /lib64/libc.so.6
#1  0x00007f744a7ba968 in abort () from /lib64/libc.so.6
#2  0x00000000010edd37 in ExceptionalCondition (conditionName=0x17e2b58
"!(NInitialRunningXacts == 0 && InitialRunningXacts == ((void *)0))",
errorType=0x17e2b45 "FailedAssertion", fileName=0x17e2b39 "snapbuild.c",
lineNumber=381) at assert.c:46
#3  0x0000000000e60b46 in AllocateSnapshotBuilder (reorder=0x551ea98,
xmin_horizon=0, start_lsn=1267160216, need_full_snapshot=false) at
snapbuild.c:381
#4  0x0000000000e50f70 in StartupDecodingContext (output_plugin_options=0x0,
start_lsn=1267160216, xmin_horizon=0, need_full_snapshot=false,
fast_forward=false, read_page=0xe53023 <logical_read_local_xlog_page>,
prepare_write=0xe52df6 <LogicalOutputPrepareWrite>, do_write=0xe52e24
<LogicalOutputWrite>, update_progress=0x0) at logical.c:191
#5  0x0000000000e518b8 in CreateDecodingContext (start_lsn=1267160216,
output_plugin_options=0x0, fast_forward=false, read_page=0xe53023
<logical_read_local_xlog_page>, prepare_write=0xe52df6
<LogicalOutputPrepareWrite>, do_write=0xe52e24 <LogicalOutputWrite>,
update_progress=0x0) at logical.c:486
#6  0x0000000000e53735 in pg_logical_slot_get_changes_guts
(fcinfo=0x7ffcd879e3d0, confirm=true, binary=false) at logicalfuncs.c:259
#7  0x0000000000e53b1c in pg_logical_slot_get_changes
(fcinfo=0x7ffcd879e3d0) at logicalfuncs.c:393
#8  0x00000000010ff89e in FunctionCallInvokeCheckSPL (fcinfo=0x7ffcd879e3d0)
at fmgr.c:2262
...
==>
And in level #3 of stack above, NInitialRunningXacts is 2 and
InitialRunningXacts is not NULL observed in one of  cores.

Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently,
the core, as far as i know, maybe caused by this way: an ERROR raised when
calling *pg_logical_slot_get_changes_guts* function. The code part of
PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
Unfortunately, I couldn't find a minimal reproduction case. However, I
observed an *ERROR: canceling statement due to statement timeout*  logged
before each core occurred. (For some reason, I can't provide the information
of log)

Attachment

pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Next
From: Michael Paquier
Date:
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary