Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs
From | Masahiko Sawada |
---|---|
Subject | Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
Date | |
Msg-id | CAD21AoDdzK-VyTRFw81N2Haj92EuhjzgmniQXUkHo0D0KezKiQ@mail.gmail.com Whole thread Raw |
In response to | RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
|
List | pgsql-bugs |
On Thu, Aug 17, 2023 at 9:10 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, August 17, 2023 6:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form > > <noreply@postgresql.org> wrote: > > > > > > > > > > The following bug has been logged on the website: > > > > > > > > > > Bug reference: 18055 > > > > > Logged by: ocean li > > > > > Email address: ocean_li_996@163.com > > > > > PostgreSQL version: 11.9 > > > > > Operating system: centos7 5.10.84 x86_64 > > > > > Description: > > > > > > > > > > For testing logical decoding module, *pg_logical_slot_get_changes* > > > > > function is used. Sometimes i got an core whose stack was like that: > > > > > ... > > > > > And in level #3 of stack above, NInitialRunningXacts is 2 and > > > > > InitialRunningXacts is not NULL observed in one of cores. > > > > > > > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear. > > > > > Currently, the core, as far as i know, maybe caused by this way: > > > > > an ERROR raised when calling > > > > > *pg_logical_slot_get_changes_guts* function. The code part of > > > > > PG_CATCH() doses not reset NInitialRunningXacts and > > InitialRunningXacts. > > > > > Then, calling pg_logical_slot_get_changes_guts again, the core may > > occur. > > > > > Unfortunately, I couldn't find a minimal reproduction case. > > > > > However, I observed an *ERROR: canceling statement due to > > > > > statement timeout* logged before each core occurred. (For some > > > > > reason, I can't provide the information of > > > > > log) > > > > > > > > Thanks for the report and the fix! > > > > > > > > I can also reproduce by the steps[1] in PG15~11(Note that we need to > > > > change LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra > > > > running xacts wal records). > > > > > > > > About the patch: > > > > > > > > --- > > > > /* free context, call shutdown callback */ > > > > FreeDecodingContext(ctx); > > > > > > > > ReplicationSlotRelease(); > > > > InvalidateSystemCaches(); > > > > } > > > > PG_CATCH(); > > > > { > > > > + > > > > + /* free context, call shutdown callback */ > > > > + FreeDecodingContext(ctx); > > > > + > > > > + ReplicationSlotRelease(); > > > > > > > > I think we could not directly call the cleanup functions here. There > > > > could be two risks: > > > > 1) it's possible either 'ctx' hasn't been initialized before the > > > > error happen, we don't need to clean it up in this case > > > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up > > > > normally before entering the catch() block which means we will double > > cleanup(free) it. > > > > > > > > To improve this, I think we can use PG_FINALLY() here and move all > > > > these cleanup functions in it and do the null check for 'ctx' before cleaning > > up. > > > > > > > > Just share one small patch based on your fix for reference, feel free to > > update it. > > > > > > > > > > - LogicalDecodingContext *ctx; > > > + LogicalDecodingContext *ctx = NULL; > > > > > > Don't we need to use volatile for ctx? See the following comment in > > > elog.h: "Note: if a local variable of the function containing PG_TRY > > > is modified in the PG_TRY section and used in the PG_CATCH section, > > > that variable must be declared "volatile" for POSIX compliance."? > > > > > Thanks for pointing it out, I think we need to use volatile for it. > > > > > BTW, don't we need a similar change in pg_logical_replication_slot_advance()? > > You are right, I can reproduce the same following the steps in [1] and > replace the last two pg_logical_slot_get_changes function calls with two > "SELECT pg_replication_slot_advance('isolation_slot', (select > pg_current_wal_lsn()));" I've not tested but the same is true in a case where pg_create_logical_replication_slot() raises an error (e.g., cancelled while DecodingContextFindStartpoint()) after creating the decoding context? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-bugs by date: