RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs

From Zhijie Hou (Fujitsu)
Subject RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Date
Msg-id OS0PR01MB57160A7DD10B517AFDC73378941AA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to BUG #18055: logical decoding core on AllocateSnapshotBuilder()  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
List pgsql-bugs
On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> 
> The following bug has been logged on the website:
> 
> Bug reference:      18055
> Logged by:          ocean li
> Email address:      ocean_li_996@163.com
> PostgreSQL version: 11.9
> Operating system:   centos7 5.10.84 x86_64
> Description:
> 
> For testing logical decoding module, *pg_logical_slot_get_changes* function
> is used. Sometimes i got an core whose stack was like that:
> ...
> And in level #3 of stack above, NInitialRunningXacts is 2 and
> InitialRunningXacts is not NULL observed in one of  cores.
> 
> Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the
> core, as far as i know, maybe caused by this way: an ERROR raised when calling
> *pg_logical_slot_get_changes_guts* function. The code part of
> PG_CATCH() doses not reset  NInitialRunningXacts and InitialRunningXacts.
> Then, calling pg_logical_slot_get_changes_guts again, the core may occur.
> Unfortunately, I couldn't find a minimal reproduction case. However, I
> observed an *ERROR: canceling statement due to statement timeout*  logged
> before each core occurred. (For some reason, I can't provide the information of
> log)

Thanks for the report and the fix!

I can also reproduce by the steps[1] in PG15~11(Note that we need to change
LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal
records).

About the patch:

---
        /* free context, call shutdown callback */
        FreeDecodingContext(ctx);

        ReplicationSlotRelease();
        InvalidateSystemCaches();
    }
     PG_CATCH();
     {
+
+        /* free context, call shutdown callback */
+        FreeDecodingContext(ctx);
+
+        ReplicationSlotRelease();

I think we could not directly call the cleanup functions here. There could be
two risks:
1) it's possible either 'ctx' hasn't been initialized before the error happen,
we don't need to clean it up in this case
2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally
before entering the catch() block which means we will double cleanup(free) it.

To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions
in it and do the null check for 'ctx' before cleaning up.

Just share one small patch based on your fix for reference, feel free to update it.

[1]
Session 1:
DROP TABLE IF EXISTS tbl1;
CREATE TABLE tbl1 (val1 integer, val2 integer);
CREATE TABLE user_cat (val1 integer) WITH (user_catalog_table = true);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
BEGIN;
SAVEPOINT sp1;
INSERT INTO tbl1 VALUES (1);
    Session 2:
    CHECKPOINT;
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
Session 1:
TRUNCATE tbl1;
COMMIT;
BEGIN;
INSERT INTO tbl1 VALUES (1);
    Session 2:
    CHECKPOINT;
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
Session 1:
COMMIT;

    Session 2:
    -- Use gdb or modify the code to make the first function call fail.
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');

    -- Since the previous command failed and the NInitialRunningXacts was not reset, we should see the Assert() failure
thistime.
 
    SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');

Best Regards,
Hou zj

Attachment

pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
Next
From: Amit Kapila
Date:
Subject: Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()