Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() |
Date | |
Msg-id | CAA4eK1JRRSeiYVscWmXfJjzbscFto9ih1uLTvQRzEeYK1nSA2w@mail.gmail.com Whole thread Raw |
In response to | Re: BUG #18055: logical decoding core on AllocateSnapshotBuilder() (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
RE: BUG #18055: logical decoding core on AllocateSnapshotBuilder()
|
List | pgsql-bugs |
On Thu, Aug 17, 2023 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 12:21 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, August 15, 2023 12:05 AM PG Bug reporting form <noreply@postgresql.org> wrote: > > > > > > The following bug has been logged on the website: > > > > > > Bug reference: 18055 > > > Logged by: ocean li > > > Email address: ocean_li_996@163.com > > > PostgreSQL version: 11.9 > > > Operating system: centos7 5.10.84 x86_64 > > > Description: > > > > > > For testing logical decoding module, *pg_logical_slot_get_changes* function > > > is used. Sometimes i got an core whose stack was like that: > > > ... > > > And in level #3 of stack above, NInitialRunningXacts is 2 and > > > InitialRunningXacts is not NULL observed in one of cores. > > > > > > Using of NInitialRunningXacts and InitialRunningXacts are clear. Currently, the > > > core, as far as i know, maybe caused by this way: an ERROR raised when calling > > > *pg_logical_slot_get_changes_guts* function. The code part of > > > PG_CATCH() doses not reset NInitialRunningXacts and InitialRunningXacts. > > > Then, calling pg_logical_slot_get_changes_guts again, the core may occur. > > > Unfortunately, I couldn't find a minimal reproduction case. However, I > > > observed an *ERROR: canceling statement due to statement timeout* logged > > > before each core occurred. (For some reason, I can't provide the information of > > > log) > > > > Thanks for the report and the fix! > > > > I can also reproduce by the steps[1] in PG15~11(Note that we need to change > > LOG_SNAPSHOT_INTERVAL_MS to a bigger number to avoid extra running xacts wal > > records). > > > > About the patch: > > > > --- > > /* free context, call shutdown callback */ > > FreeDecodingContext(ctx); > > > > ReplicationSlotRelease(); > > InvalidateSystemCaches(); > > } > > PG_CATCH(); > > { > > + > > + /* free context, call shutdown callback */ > > + FreeDecodingContext(ctx); > > + > > + ReplicationSlotRelease(); > > > > I think we could not directly call the cleanup functions here. There could be > > two risks: > > 1) it's possible either 'ctx' hasn't been initialized before the error happen, > > we don't need to clean it up in this case > > 2) It's possible 'ctx' or 'MyReplicationSlot' been be cleaned up normally > > before entering the catch() block which means we will double cleanup(free) it. > > > > To improve this, I think we can use PG_FINALLY() here and move all these cleanup functions > > in it and do the null check for 'ctx' before cleaning up. > > > > Just share one small patch based on your fix for reference, feel free to update it. > > > > - LogicalDecodingContext *ctx; > + LogicalDecodingContext *ctx = NULL; > > Don't we need to use volatile for ctx? See the following comment in > elog.h: "Note: if a local variable of the function containing PG_TRY > is modified in the PG_TRY section and used in the PG_CATCH section, > that variable must be declared "volatile" for POSIX compliance."? > BTW, don't we need a similar change in pg_logical_replication_slot_advance()? -- With Regards, Amit Kapila.
pgsql-bugs by date: