RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
Date | |
Msg-id | OS0PR01MB57160A7DD10B517AFDC73378941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | BUG #18055: logical decoding core on AllocateSnapshotBuilder() (PG Bug reporting form <noreply@postgresql.org>) |
Responses |
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
|
List | pgsql-bugs |
On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 18055 > Logged by: ocean li > Email address: ocean_li_996@163.com > PostgreSQL version: 11.9 > Operating system: centos7 5.10.84 x86_64 > Description: > > For testing logical decoding module, *pg_logical_slot_get_changes* function > is used. Sometimes i got an core whose stack was like that: > ... > And in level #3 of stack above, NInitialRunningXacts is 2 and > InitialRunningXacts is not NULL observed in one of cores. > > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the > core, as far as i know, maybe caused by this way: an ERROR raised when calling > *pg_logical_slot_get_changes_guts* function. The code part of > PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts. > Then, calling pg_logical_slot_get_changes_guts again, the core may occur. > Unfortunately, I couldn't find a minimal reproduction case. However, I > observed an *ERROR: canceling statement due to statement timeout* logged > before each core occurred. (For some reason, I can't provide the information of > log) Thanks for the report and the fix! I can also reproduce by the steps[1] in PG15~11(Note that we need to change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal records). About the patch: --- /* free context, call shutdown callback */ FreeDecodingContext(ctx); ReplicationSlotRelease(); InvalidateSystemCaches(); } PG_CATCH(); { + + /* free context, call shutdown callback */ + FreeDecodingContext(ctx); + + ReplicationSlotRelease(); I think we could not directly call the cleanup functions here. There could be two risks: 1) it's possible either 'ctx' hasn't been initialized before the error happen, we don't need to clean it up in this case 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally before entering the catch() block which means we will double cleanup(free) it. To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions in it and do the null check for 'ctx' before cleaning up. Just share one small patch based on your fix for reference, feel free to update it. [1] Session 1: DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 (val1 integer, val2 integer); CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); BEGIN; SAVEPOINT sp1; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: TRUNCATE tbl1; COMMIT; BEGIN; INSERT INTO tbl1 VALUES (1); Session 2: CHECKPOINT; SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Session 1: COMMIT; Session 2: -- Use gdb or modify the code to make the first function call fail. SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); -- Since the previous command failed and the NInitialRunningXacts was not reset, we should see the Assert() failure thistime. SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0'); Best Regards, Hou zj
Attachment
pgsql-bugs by date: