Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
Date | |
Msg-id | CAA4eK1LXD6iv2oEL8a-qYx6Y+MPv7eBim-dF-NCsJ-QL8z3j1Q@mail.gmail.com Whole thread Raw |
In response to | RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder() ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
(Amit Kapila <amit.kapila16@gmail.com>)
|
List | pgsql-bugs |
On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > > > The following bug has been logged on the website: > > > > Bug reference: 18055 > > Logged by: ocean li > > Email address: ocean_li_996@163.com > > PostgreSQL version: 11.9 > > Operating system: centos7 5.10.84 x86_64 > > Description: > > > > For testing logical decoding module, *pg_logical_slot_get_changes* function > > is used. Sometimes i got an core whose stack was like that: > > ... > > And in level #3 of stack above, NInitialRunningXacts is 2 and > > InitialRunningXacts is not NULL observed in one of cores. > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the > > core, as far as i know, maybe caused by this way: an ERROR raised when calling > > *pg_logical_slot_get_changes_guts* function. The code part of > > PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts. > > Then, calling pg_logical_slot_get_changes_guts again, the core may occur. > > Unfortunately, I couldn't find a minimal reproduction case. However, I > > observed an *ERROR: canceling statement due to statement timeout* logged > > before each core occurred. (For some reason, I can't provide the information of > > log) > > Thanks for the report and the fix! > > I can also reproduce by the steps[1] in PG15~11(Note that we need to change > LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal > records). > > About the patch: > > --- > /* free context, call shutdown callback */ > FreeDecodingContext(ctx); > > ReplicationSlotRelease(); > InvalidateSystemCaches(); > } > PG_CATCH(); > { > + > + /* free context, call shutdown callback */ > + FreeDecodingContext(ctx); > + > + ReplicationSlotRelease(); > > I think we could not directly call the cleanup functions here. There could be > two risks: > 1) it's possible either 'ctx' hasn't been initialized before the error happen, > we don't need to clean it up in this case > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally > before entering the catch() block which means we will double cleanup(free) it. > > To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions > in it and do the null check for 'ctx' before cleaning up. > > Just share one small patch based on your fix for reference, feel free to update it. > - LogicalDecodingContext *ctx; + LogicalDecodingContext *ctx = NULL; Don't we need to use volatile for ctx? See the following comment in elog.h: "Note: if a local variable of the function containing PG_TRY is modified in the PG_TRY section and used in the PG_CATCH section, that variable must be declared "volatile" for POSIX compliance."? -- With Regards, Amit Kapila.
pgsql-bugs by date: