RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder() - Mailing list pgsql-bugs

From Hayato Kuroda (Fujitsu)
Subject RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Date
Msg-id TYCPR01MB12077369E4B9B34979378F435F52B2@TYCPR01MB12077.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()  (ocean_li_996 <ocean_li_996@163.com>)
Responses Re:RE: Re:RE: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
List pgsql-bugs
Dear Haiyang,

Thanks for checking! This reply was still focused only on "Issue 2" in your notation.

>## Issue 2
>Inspired by your spec case, I've reorganized the spec case provided in [2]. The new test in attachment
>is  able to reproduce the issue mentioned in [1] even before commit 6b77048e5.

Good findings. I also confirmed the workload could fail after reverting the 6b77048e5.
Also confirmed that the patch [1] could fix the workload as well.

permutation "s0_init" "s0_begin" "s0_savepoint" "s0_create_part1" "s0_savepoint_release"
                     "s2_init" "s1_checkpoint" "s1_get_changes" "s0_commit" "s2_get_changes"

## Analysis

The point was that the serialized snapshot by another replication slot can be reused.
When the first get_change is called, a consistent snapshot can be serialized because
of the XLOG_RUNNING_XACTS record (see later).
The get_changes for the second slot reuses so that it can read WAL records property.
(If the first slot does not exist, the status of the snapshot would be
SNAPBUILD_BUILDING_SNAPSHOT. So no records are read)

In the second get_changes, below records are read. First (LOCK, RUNNING_XACTS)
pair is generated from the slot creation, and second pair comes from the
CHECKPOINT. I.e., it reads all records from the slot generation.

```
...lsn: 0/01906DB8, prev 0/01906D58, desc: LOCK ...
...lsn: 0/01906DF0, prev 0/01906DB8, desc: RUNNING_XACTS ...
...lsn: 0/01906E30, prev 0/01906DF0, desc: LOCK ...
...lsn: 0/01906E68, prev 0/01906E30, desc: RUNNING_XACTS ...
...lsn: 0/01906EA8, prev 0/01906E68, desc: CHECKPOINT_ONLINE ...
...lsn: 0/01906F20, prev 0/01906EA8, desc: COMMIT ... subxacts: 728; ... inval msgs: ...
```

Also the final COMMIT record contains the info for a subtransaction and
XACT_XINFO_HAS_INVALS flag, so DecodeCommit()->SnapBuildXidSetCatalogChanges()
is called transactions.

After that, two ReorderBufferTXNs are created with the same LSN, it fails the
assertion in AssertTXNLsnOrder().

I will update the patch if above analysis is correct.

>The approach in [3] is also LGFM.

Thanks. I agreed that we should not ease condition for Assert() as much as possible.

[1]:
https://www.postgresql.org/message-id/TYCPR01MB1207790E98F0A563280CC39FCF5262%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/




pgsql-bugs by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view
Next
From: Maxim Boguk
Date:
Subject: Re: BUG #18387: Erroneous permission checks and/or misleading error messages with refresh materialized view