Re: test_decoding assertion failure for the loss of top-sub transaction relationship - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: test_decoding assertion failure for the loss of top-sub transaction relationship
Date
Msg-id CAA4eK1LK1nxOTL32OP=ejhPoBsUP4Bvwb3Ly=fethyJ-KbaXyw@mail.gmail.com
Whole thread Raw
In response to test_decoding assertion failure for the loss of top-sub transaction relationship  ("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
List pgsql-hackers
On Fri, Sep 2, 2022 at 6:26 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
>
>
> I've met an assertion failure of logical decoding with below scenario on HEAD.
>
> ---
> <preparation>
> create table tab1 (val integer);
> select 'init' from  pg_create_logical_replication_slot('regression_slot', 'test_decoding');
>
> <session1>
> begin;
> savepoint sp1;
> insert into tab1 values (1);
>
> <session2>
> checkpoint; -- for RUNNING_XACT
> select data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts',
'1');
>
> <session1>
> truncate tab1; -- for NEW_CID
> commit;
> begin;
> insert into tab1 values (3);
>

BTW, if I just change the truncate statement to "Analyze tab1" in your
entire test then I am getting a different assertion failure:

postgres.exe!ExceptionalCondition(const char * conditionName, const
char * errorType, const char * fileName, int lineNumber) Line 70 C
postgres.exe!AssertTXNLsnOrder(ReorderBuffer * rb) Line 902 C
postgres.exe!ReorderBufferTXNByXid(ReorderBuffer * rb, unsigned int
xid, bool create, bool * is_new, unsigned __int64 lsn, bool
create_as_top) Line 681 C
postgres.exe!ReorderBufferAddNewTupleCids(ReorderBuffer * rb, unsigned
int xid, unsigned __int64 lsn, RelFileLocator locator, ItemPointerData
tid, unsigned int cmin, unsigned int cmax, unsigned int combocid) Line
3188 C
postgres.exe!SnapBuildProcessNewCid(SnapBuild * builder, unsigned int
xid, unsigned __int64 lsn, xl_heap_new_cid * xlrec) Line 823 C
postgres.exe!heap2_decode(LogicalDecodingContext * ctx,
XLogRecordBuffer * buf) Line 408 C
postgres.exe!LogicalDecodingProcessRecord(LogicalDecodingContext *
ctx, XLogReaderState * record) Line 119 C
postgres.exe!pg_logical_slot_get_changes_guts(FunctionCallInfoBaseData
* fcinfo, bool confirm, bool binary) Line 274 C
postgres.exe!pg_logical_slot_get_changes(FunctionCallInfoBaseData *
fcinfo) Line 339 C

This is matching with call stack we see intermittently in the BF
[1][2]. The difference with your scenario is that the Truncate
statement generates an additional WAL XLOG_STANDBY_LOCK prior to
XLOG_HEAP2_NEW_CID.

I think we can fix this in the below ways:
a. Assert(prev_first_lsn <= cur_txn->first_lsn); -- Explain in
comments that it is possible when subtransaction and transaction are
not previously logged as it happened in this scenario
b. track txn of prev_first_lsn (say as prev_txn) and check if
prev_txn's toptxn is the same as cur_txn or cur_txn's toptxn is the
same as the prev_txn then perform assert mentioned in (a) else, keep
the current Assert.

It seems (b) will be more robust.

Thoughts?

Note: I added Sawada-San as sometime back we had an offlist discussion
on this intermittent BF failure but we were not able to reach the
exact test which can show this failure.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-08-20%2002%3A45%3A34
[2] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-02-25%2018%3A50%3A09

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: postgres_fdw hint messages
Next
From: Amit Kapila
Date:
Subject: Re: win_flex.exe (and likely win_bison.exe) isn't concurrency safe