Re: BUG #19109: catalog lookup with the wrong snapshot duringlogical decoding causes coredump - Mailing list pgsql-bugs

From ocean_li_996
Subject Re: BUG #19109: catalog lookup with the wrong snapshot duringlogical decoding causes coredump
Date
Msg-id 74c88ce1.a69e.19aa27b3fbe.Coremail.ocean_li_996@163.com
Whole thread Raw
In response to Re: BUG #19109: catalog lookup with the wrong snapshot duringlogical decoding causes coredump  ("cca5507" <cca5507@qq.com>)
Responses Re: BUG #19109: catalog lookup with the wrong snapshotduringlogical decoding causes coredump
List pgsql-bugs
Hi ChangAo,

At 2025-11-19 11:23:19, "cca5507" <cca5507@qq.com> wrote:
> I reported this bug before, and there is a commitfest entry for it:
>
>
> The attached patches are the latest patches taken from the commitfest.

Thanks for your report and patches. 

> This bug seem to affect a lot of PG major versions.

Absolutely. I can reproduced it from 11 to 19. Then, I researched the source code. IMUC, this bug exists since
logical replication intrduced(v9.4). It seems that snapshot builder does nothing except update builder->xmin in
SNAPBUILD_BUILDING_SNAPSHOT state. But, tracking xid which >= builder->next_pahse_at is reasonable
for me. BTW,  this bug can result in unpredictable behavior, ranging from various error reports to core dumps. 
It deserves some extra attention, I think. 

It’s somewhat unfortunate that the issue remains unresolved so far. Hopefully we can address it in the
upcoming discussion or commitfest.

> They seem to need a rebase, but I'd like to see your thoughts on these patches first.

Some comments:
1 ====
@@ -1301,6 +1338,7 @@ DecodeTXNNeedSkip(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
Oid txn_dbid, RepOriginId origin_id)
{
if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
+ SnapBuildCurrentState(ctx->snapshot_builder) < SNAPBUILD_CONSISTENT ||
(txn_dbid != InvalidOid && txn_dbid != ctx->slot->data.database) ||
FilterByOrigin(ctx, origin_id))
return true;
It seems that this change will cause unexpected bahavior.  For expample, it will make the xact commited in SNAPBUILD_FULL_SNAPSHOT
to be skipped in DecodeCommit(). Do you mean "SnapBuildCurrentState(ctx->snapshot_builder) < SNAPBUILD_FULL_SNAPSHOT"?

2 ===
@@ -282,9 +287,11 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
{
TransactionId xid;
xl_xact_invals *invals;
+ bool has_snapshot;
...
+ else if (!ctx->fast_forward && has_snapshot)
ReorderBufferImmediateInvalidation(ctx->reorder,
invals->nmsgs,
invals->msgs);
May be the following change is more clear for me:
if (TransactionIdIsValid(xid) &&
SnapBuildCurrentState(builder) >= SNAPBUILD_FULL_SNAPSHOT)
{
ReorderBufferXidSetCatalogChanges(ctx->reorder, xid,
buf->origptr);
break;
}

3===
@@ -416,7 +423,22 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
*/
if (SnapBuildCurrentState(builder) < SNAPBUILD_FULL_SNAPSHOT ||
ctx->fast_forward)
+ {
+ /*
+ * Note that during or after BUILDING_SNAPSHOT, we need handle the xlog
+ * that might mark a transaction as catalog modifying because the snapshot
+ * only tracks catalog modifying transactions. The transaction before
+ * BUILDING_SNAPSHOT will not be tracked anyway(see SnapBuildCommitTxn()
+ * for details), so just return.
+ */
+ if (SnapBuildCurrentState(builder) >= SNAPBUILD_BUILDING_SNAPSHOT)
+ {
+ /* Currently only XLOG_HEAP2_NEW_CID means a catalog modifying */
+ if (info == XLOG_HEAP2_NEW_CID && TransactionIdIsValid(xid))
+ ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
+ }
return;
+ }
It should be removed from v18 and master because commit 243e9b4 intrduced.

4===
IMUC, v4-0001 patch is to only mark transactions with catalog changes during the buildsnapshot phase, 
without performing any other actions.  I’m wondering if we should make the behavior of snapshot building
the same in both SNAPBUILD_BUILDING_SNAPSHOT and SNAPBUILD_FULL_SNAPSHOT states,
except in cases where a base snapshot is needed. If so, the attached patch is for the master version.

Regards,
Haiyang Li
Attachment

pgsql-bugs by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [EXTERNAL] Re: BUG #19114: ORDER BY ASC is tampering result when calculating distance btw vectors
Next
From: "cca5507"
Date:
Subject: Re: BUG #19109: catalog lookup with the wrong snapshotduringlogical decoding causes coredump