Thread: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

Hi,

I've been running some valgrind tests on rpi4/aarch64, and I get a crash
in test_decoding ddl test in ~50% runs. I don't see the same failure
without valgrind or on 32-bit system (hundreds of runs, no crashes), so
I suspect this is a race condition, and with valgrind the timing changes
in a way to make it more likely.

The crash always happens in the "ddl" test. The backtrace always looks
like this:

    (ExceptionalCondition+0x98)[0x8f6f7c]
    (+0x57a7ec)[0x6827ec]
    (+0x579edc)[0x681edc]
    (ReorderBufferAddNewTupleCids+0x60)[0x686758]
    (SnapBuildProcessNewCid+0x94)[0x68b920]
    (heap2_decode+0x17c)[0x671584]
    (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
    (+0x570f88)[0x678f88]
    (pg_logical_slot_get_changes+0x1c)[0x6790fc]
    (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
    (+0x3be638)[0x4c6638]
    (+0x3a2c14)[0x4aac14]
    (ExecScan+0x8c)[0x4aaca8]
    (+0x3bea14)[0x4c6a14]
    (+0x39ea60)[0x4a6a60]
    (+0x392378)[0x49a378]
    (+0x39520c)[0x49d20c]
    (standard_ExecutorRun+0x214)[0x49aad8]
    (ExecutorRun+0x64)[0x49a8b8]
    (+0x62f53c)[0x73753c]
    (PortalRun+0x27c)[0x737198]
    (+0x627e78)[0x72fe78]
    (PostgresMain+0x9a0)[0x73512c]
    (+0x547be8)[0x64fbe8]
    (+0x547540)[0x64f540]
    (+0x542d30)[0x64ad30]
    (PostmasterMain+0x1460)[0x64a574]
    (+0x418888)[0x520888]

I'm unable to get a better backtrace from the valgrind-produces core
usign gdb, for some reason.

However, I've modified AssertTXNLsnOrder() - which is where the assert
is checked - to also dump toplevel_by_lsn instead of just triggering the
assert, and the result is always like this:

    WARNING:  ==============================================
    WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
    WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
    WARNING:  ==============================================

The LSNs change a bit between the runs, but the failing transactions are
always 848 and 849. Also, both transactions have exactly the same info.

But the very first WAL record for 849 is

    ASSIGNMENT xtop 848: subxacts: 849

so it's strange 849 is in the toplevel_by_lsn list at all, because it
clearly is a subxact of 848.

Furthermore, the WAL is almost exactly the same in both cases. Attached
are two dumps from a failed and successful run (only the part related to
these two xids is included). There are very few differences - there is a
PRUNE in the failed case, and a LOCK / RUNNING_XACTS moved a bit.


Any ideas?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
> in test_decoding ddl test in ~50% runs. I don't see the same failure
> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
> I suspect this is a race condition, and with valgrind the timing changes
> in a way to make it more likely.
>
> The crash always happens in the "ddl" test. The backtrace always looks
> like this:
>
>     (ExceptionalCondition+0x98)[0x8f6f7c]
>     (+0x57a7ec)[0x6827ec]
>     (+0x579edc)[0x681edc]
>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>     (heap2_decode+0x17c)[0x671584]
>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>     (+0x570f88)[0x678f88]
>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>     (+0x3be638)[0x4c6638]
>     (+0x3a2c14)[0x4aac14]
>     (ExecScan+0x8c)[0x4aaca8]
>     (+0x3bea14)[0x4c6a14]
>     (+0x39ea60)[0x4a6a60]
>     (+0x392378)[0x49a378]
>     (+0x39520c)[0x49d20c]
>     (standard_ExecutorRun+0x214)[0x49aad8]
>     (ExecutorRun+0x64)[0x49a8b8]
>     (+0x62f53c)[0x73753c]
>     (PortalRun+0x27c)[0x737198]
>     (+0x627e78)[0x72fe78]
>     (PostgresMain+0x9a0)[0x73512c]
>     (+0x547be8)[0x64fbe8]
>     (+0x547540)[0x64f540]
>     (+0x542d30)[0x64ad30]
>     (PostmasterMain+0x1460)[0x64a574]
>     (+0x418888)[0x520888]
>
> I'm unable to get a better backtrace from the valgrind-produces core
> usign gdb, for some reason.
>
> However, I've modified AssertTXNLsnOrder() - which is where the assert
> is checked - to also dump toplevel_by_lsn instead of just triggering the
> assert, and the result is always like this:
>
>     WARNING:  ==============================================
>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>     WARNING:  ==============================================
>
> The LSNs change a bit between the runs, but the failing transactions are
> always 848 and 849. Also, both transactions have exactly the same info.
>
> But the very first WAL record for 849 is
>
>     ASSIGNMENT xtop 848: subxacts: 849
>
> so it's strange 849 is in the toplevel_by_lsn list at all, because it
> clearly is a subxact of 848.
>

There is no guarantee that toplevel_by_lsn won't have subxact. As per
my understanding, the problem I reported in the email [1] is the same
and we have seen this in BF failures as well. I posted a way to
reproduce it in that email. It seems this is possible if the decoding
gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
subtransaction) after XLOG_RUNNING_XACTS.

[1] - https://www.postgresql.org/message-id/CAA4eK1LK1nxOTL32OP%3DejhPoBsUP4Bvwb3Ly%3DfethyJ-KbaXyw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




On 9/4/22 13:49, Amit Kapila wrote:
> On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
>> in test_decoding ddl test in ~50% runs. I don't see the same failure
>> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
>> I suspect this is a race condition, and with valgrind the timing changes
>> in a way to make it more likely.
>>
>> The crash always happens in the "ddl" test. The backtrace always looks
>> like this:
>>
>>     (ExceptionalCondition+0x98)[0x8f6f7c]
>>     (+0x57a7ec)[0x6827ec]
>>     (+0x579edc)[0x681edc]
>>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>>     (heap2_decode+0x17c)[0x671584]
>>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>>     (+0x570f88)[0x678f88]
>>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>>     (+0x3be638)[0x4c6638]
>>     (+0x3a2c14)[0x4aac14]
>>     (ExecScan+0x8c)[0x4aaca8]
>>     (+0x3bea14)[0x4c6a14]
>>     (+0x39ea60)[0x4a6a60]
>>     (+0x392378)[0x49a378]
>>     (+0x39520c)[0x49d20c]
>>     (standard_ExecutorRun+0x214)[0x49aad8]
>>     (ExecutorRun+0x64)[0x49a8b8]
>>     (+0x62f53c)[0x73753c]
>>     (PortalRun+0x27c)[0x737198]
>>     (+0x627e78)[0x72fe78]
>>     (PostgresMain+0x9a0)[0x73512c]
>>     (+0x547be8)[0x64fbe8]
>>     (+0x547540)[0x64f540]
>>     (+0x542d30)[0x64ad30]
>>     (PostmasterMain+0x1460)[0x64a574]
>>     (+0x418888)[0x520888]
>>
>> I'm unable to get a better backtrace from the valgrind-produces core
>> usign gdb, for some reason.
>>
>> However, I've modified AssertTXNLsnOrder() - which is where the assert
>> is checked - to also dump toplevel_by_lsn instead of just triggering the
>> assert, and the result is always like this:
>>
>>     WARNING:  ==============================================
>>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>     WARNING:  ==============================================
>>
>> The LSNs change a bit between the runs, but the failing transactions are
>> always 848 and 849. Also, both transactions have exactly the same info.
>>
>> But the very first WAL record for 849 is
>>
>>     ASSIGNMENT xtop 848: subxacts: 849
>>
>> so it's strange 849 is in the toplevel_by_lsn list at all, because it
>> clearly is a subxact of 848.
>>
> 
> There is no guarantee that toplevel_by_lsn won't have subxact.

I don't think that's quite true - toplevel_by_lsn should not contain any
*known* subxacts. Yes, we may initially add a subxact to the list, but
as soon as we get assignment record, it should be removed. See what
ReorderBufferAssignChild does.

And in this case the ASSIGNMENT is the first WAL record we get for 849
(in fact, isn't that guaranteed since 7259736a6e?), so we know from the
very beginning 849 is a subxact.


> As per
> my understanding, the problem I reported in the email [1] is the same
> and we have seen this in BF failures as well. I posted a way to
> reproduce it in that email. It seems this is possible if the decoding
> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
> subtransaction) after XLOG_RUNNING_XACTS.
> 

Interesting. That's certainly true for WAL in the crashing case:

rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
58/38; cmin: 1, cmax: 14, combo: 6


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 9/4/22 14:24, Tomas Vondra wrote:
> 
> 
> On 9/4/22 13:49, Amit Kapila wrote:
>> On Sun, Sep 4, 2022 at 4:34 PM Tomas Vondra
>> <tomas.vondra@enterprisedb.com> wrote:
>>>
>>> I've been running some valgrind tests on rpi4/aarch64, and I get a crash
>>> in test_decoding ddl test in ~50% runs. I don't see the same failure
>>> without valgrind or on 32-bit system (hundreds of runs, no crashes), so
>>> I suspect this is a race condition, and with valgrind the timing changes
>>> in a way to make it more likely.
>>>
>>> The crash always happens in the "ddl" test. The backtrace always looks
>>> like this:
>>>
>>>     (ExceptionalCondition+0x98)[0x8f6f7c]
>>>     (+0x57a7ec)[0x6827ec]
>>>     (+0x579edc)[0x681edc]
>>>     (ReorderBufferAddNewTupleCids+0x60)[0x686758]
>>>     (SnapBuildProcessNewCid+0x94)[0x68b920]
>>>     (heap2_decode+0x17c)[0x671584]
>>>     (LogicalDecodingProcessRecord+0xbc)[0x670cd0]
>>>     (+0x570f88)[0x678f88]
>>>     (pg_logical_slot_get_changes+0x1c)[0x6790fc]
>>>     (ExecMakeTableFunctionResult+0x29c)[0x4a92c0]
>>>     (+0x3be638)[0x4c6638]
>>>     (+0x3a2c14)[0x4aac14]
>>>     (ExecScan+0x8c)[0x4aaca8]
>>>     (+0x3bea14)[0x4c6a14]
>>>     (+0x39ea60)[0x4a6a60]
>>>     (+0x392378)[0x49a378]
>>>     (+0x39520c)[0x49d20c]
>>>     (standard_ExecutorRun+0x214)[0x49aad8]
>>>     (ExecutorRun+0x64)[0x49a8b8]
>>>     (+0x62f53c)[0x73753c]
>>>     (PortalRun+0x27c)[0x737198]
>>>     (+0x627e78)[0x72fe78]
>>>     (PostgresMain+0x9a0)[0x73512c]
>>>     (+0x547be8)[0x64fbe8]
>>>     (+0x547540)[0x64f540]
>>>     (+0x542d30)[0x64ad30]
>>>     (PostmasterMain+0x1460)[0x64a574]
>>>     (+0x418888)[0x520888]
>>>
>>> I'm unable to get a better backtrace from the valgrind-produces core
>>> usign gdb, for some reason.
>>>
>>> However, I've modified AssertTXNLsnOrder() - which is where the assert
>>> is checked - to also dump toplevel_by_lsn instead of just triggering the
>>> assert, and the result is always like this:
>>>
>>>     WARNING:  ==============================================
>>>     WARNING:  txn xid 849 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>>     WARNING:  txn xid 848 top 0 first 30264752 0/1CDCDB0 final 0 0/0
>>>     WARNING:  ==============================================
>>>
>>> The LSNs change a bit between the runs, but the failing transactions are
>>> always 848 and 849. Also, both transactions have exactly the same info.
>>>
>>> But the very first WAL record for 849 is
>>>
>>>     ASSIGNMENT xtop 848: subxacts: 849
>>>
>>> so it's strange 849 is in the toplevel_by_lsn list at all, because it
>>> clearly is a subxact of 848.
>>>
>>
>> There is no guarantee that toplevel_by_lsn won't have subxact.
> 
> I don't think that's quite true - toplevel_by_lsn should not contain any
> *known* subxacts. Yes, we may initially add a subxact to the list, but
> as soon as we get assignment record, it should be removed. See what
> ReorderBufferAssignChild does.
> 
> And in this case the ASSIGNMENT is the first WAL record we get for 849
> (in fact, isn't that guaranteed since 7259736a6e?), so we know from the
> very beginning 849 is a subxact.
> 
> 
>> As per
>> my understanding, the problem I reported in the email [1] is the same
>> and we have seen this in BF failures as well. I posted a way to
>> reproduce it in that email. It seems this is possible if the decoding
>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
>> subtransaction) after XLOG_RUNNING_XACTS.
>>
> 
> Interesting. That's certainly true for WAL in the crashing case:
> 
> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
> 58/38; cmin: 1, cmax: 14, combo: 6
> 

I investigated using the pgdata from the crashed run (can provide, if
you have rpi4 or some other aarch64 machine), and the reason is pretty
simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
after the subxact assignment, so we add both xids as toplevel.

That seems broken - if we skip the assignment like this, doesn't that
break spill-to-disk and/or streaming? IIRC that's exactly why we had to
start logging assignments immediately with wal_level=logical.

Or maybe we're not dealing with the restart_lsn properly, and we should
have ignored those records. Both xacts started long before the restart
LSN, so we're not seeing the whole xact anyway.

However, when processing the NEW_CID record:

tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6

we ultimately do this in SnapBuildProcessNewCid:

#1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
combocid=6) at reorderbuffer.c:3218
#2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818

so in fact we *know* 849 is a subxact of 848, but we don't call
ReorderBufferAssignChild in this case. In fact we can't even do the
assignment easily in this case, because we create the subxact first, so
that the crash happens right when we attempt to create the toplevel one,
and we never even get a chance to do the assignment:

1) process the NEW_CID record, logged for 849 (subxact)
2) process CIDs in the WAL record, which has topleve_xid 848


So IMHO we need to figure out what to do for WAL records that create
both the toplevel and subxact - either we need to skip them, or rethink
how we create the ReorderBufferTXN structs.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 9/4/22 16:08, Tomas Vondra wrote:
> ...
> 
> so in fact we *know* 849 is a subxact of 848, but we don't call
> ReorderBufferAssignChild in this case. In fact we can't even do the
> assignment easily in this case, because we create the subxact first, so
> that the crash happens right when we attempt to create the toplevel one,
> and we never even get a chance to do the assignment:
> 
> 1) process the NEW_CID record, logged for 849 (subxact)
> 2) process CIDs in the WAL record, which has topleve_xid 848
> 
> 
> So IMHO we need to figure out what to do for WAL records that create
> both the toplevel and subxact - either we need to skip them, or rethink
> how we create the ReorderBufferTXN structs.
> 

This fixes the crash for me, by adding a ReorderBufferAssignChild call
to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
ensure we don't try to create the top xact before updating the subxact
and removing it from the toplevel_by_lsn list.

Essentially, what's happening is this:

1) We read the NEW_CID record, which is logged with XID 849, i.e. the
subxact. But we don't know it's a subxact, so we create it as a
top-level xact with the LSN.

2) We start processing contents of the NEW_CID, which however has info
that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
promptly does ReorderBufferTXNByXid() with the top-level XID, which
creates it with the same LSN, and crashes because of the assert.

I'm not sure what's the right/proper way to fix this ...

The problem is ReorderBufferAssignChild was coded in a way that did not
expect the subxact to be created first (as a top-level xact). And
indeed, if I add Assert(false) to the (!new_sub) branch that converts
top-level xact to subxact, check-world still passes. So we never test
this case, but the NEW_CID breaks this assumption and creates them in
the opposite order (i.e. subxact first).

So the patch "fixes" this by

(a) tweaking ReorderBufferAssignChild to first remove the subxact from
the list of top-level transactions

(b) call ReorderBufferAssignChild when processing NEW_CID

However, I wonder whether we even have to process these records? If the
restart_lsn is half-way through the xact, so can we even decode it?
Maybe we can just skip all of this, somehow? We'd still need to remember
849 is a subxact of 848, at least, so that we know to skip it too.


Thread [1] suggested to relax the assert to allow the same LSN, provided
it's xact and it's subxact. That goes directly against the expectation
the toplevel_by_lsn list contains no known subxacts, and I don't think
we should be relaxing that. After all, just tweaking the LSN does not
really fix the issue, because not remembering it's xact+subxact is part
of the issue. In principle, I think the issue is exactly the opposite,
i.e. that we don't realize 849 is a subxact, and leave it in the list.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment
On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/4/22 14:24, Tomas Vondra wrote:
> >
> >> As per
> >> my understanding, the problem I reported in the email [1] is the same
> >> and we have seen this in BF failures as well. I posted a way to
> >> reproduce it in that email. It seems this is possible if the decoding
> >> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
> >> subtransaction) after XLOG_RUNNING_XACTS.
> >>
> >
> > Interesting. That's certainly true for WAL in the crashing case:
> >
> > rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> > 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
> > latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
> > rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
> > 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
> > 58/38; cmin: 1, cmax: 14, combo: 6
> >
>
> I investigated using the pgdata from the crashed run (can provide, if
> you have rpi4 or some other aarch64 machine), and the reason is pretty
> simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
> after the subxact assignment, so we add both xids as toplevel.
>
> That seems broken - if we skip the assignment like this, doesn't that
> break spill-to-disk and/or streaming? IIRC that's exactly why we had to
> start logging assignments immediately with wal_level=logical.
>

We had started logging assignments immediately in commit 0bead9af48
for streaming transactions in PG14. This issue exists prior to that. I
have tried and reproduced it in PG13 but I think it will be there even
before that. So, I am not sure if the spilling behavior is broken due
to this. I think if we don't get assignment recording before
processing changes during decoding commit then we could miss sending
the changes which won't be the case here. Do you see any other
problem?

> Or maybe we're not dealing with the restart_lsn properly, and we should
> have ignored those records. Both xacts started long before the restart
> LSN, so we're not seeing the whole xact anyway.
>

Right, but is that problematic? The restart LSN will be used as a
point to start reading the WAL and that helps in building a consistent
snapshot. However, for decoding to send the commit, we use
start_decoding_at point which will ensure that we send complete
transactions.

> However, when processing the NEW_CID record:
>
> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
>
> we ultimately do this in SnapBuildProcessNewCid:
>
> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
> combocid=6) at reorderbuffer.c:3218
> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
>
> so in fact we *know* 849 is a subxact of 848, but we don't call
> ReorderBufferAssignChild in this case. In fact we can't even do the
> assignment easily in this case, because we create the subxact first, so
> that the crash happens right when we attempt to create the toplevel one,
> and we never even get a chance to do the assignment:
>
> 1) process the NEW_CID record, logged for 849 (subxact)
> 2) process CIDs in the WAL record, which has topleve_xid 848
>
>
> So IMHO we need to figure out what to do for WAL records that create
> both the toplevel and subxact - either we need to skip them, or rethink
> how we create the ReorderBufferTXN structs.
>

As per my understanding, we can't skip them as they are used to build
the snapshot.

-- 
With Regards,
Amit Kapila.



On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/4/22 16:08, Tomas Vondra wrote:
> > ...
> >
> > so in fact we *know* 849 is a subxact of 848, but we don't call
> > ReorderBufferAssignChild in this case. In fact we can't even do the
> > assignment easily in this case, because we create the subxact first, so
> > that the crash happens right when we attempt to create the toplevel one,
> > and we never even get a chance to do the assignment:
> >
> > 1) process the NEW_CID record, logged for 849 (subxact)
> > 2) process CIDs in the WAL record, which has topleve_xid 848
> >
> >
> > So IMHO we need to figure out what to do for WAL records that create
> > both the toplevel and subxact - either we need to skip them, or rethink
> > how we create the ReorderBufferTXN structs.
> >
>
> This fixes the crash for me, by adding a ReorderBufferAssignChild call
> to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
> ensure we don't try to create the top xact before updating the subxact
> and removing it from the toplevel_by_lsn list.
>
> Essentially, what's happening is this:
>
> 1) We read the NEW_CID record, which is logged with XID 849, i.e. the
> subxact. But we don't know it's a subxact, so we create it as a
> top-level xact with the LSN.
>
> 2) We start processing contents of the NEW_CID, which however has info
> that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
> promptly does ReorderBufferTXNByXid() with the top-level XID, which
> creates it with the same LSN, and crashes because of the assert.
>
> I'm not sure what's the right/proper way to fix this ...
>
> The problem is ReorderBufferAssignChild was coded in a way that did not
> expect the subxact to be created first (as a top-level xact).
>

I think there was a previously hard-coded way to detect that and we
have removed it in commit
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
I think it is possible that subtransaction gets logged without
previous top-level txn record as shown in the commit shared.

-- 
With Regards,
Amit Kapila.




On 9/5/22 06:32, Amit Kapila wrote:
> On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 9/4/22 14:24, Tomas Vondra wrote:
>>>
>>>> As per
>>>> my understanding, the problem I reported in the email [1] is the same
>>>> and we have seen this in BF failures as well. I posted a way to
>>>> reproduce it in that email. It seems this is possible if the decoding
>>>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
>>>> subtransaction) after XLOG_RUNNING_XACTS.
>>>>
>>>
>>> Interesting. That's certainly true for WAL in the crashing case:
>>>
>>> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
>>> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
>>> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
>>> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
>>> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
>>> 58/38; cmin: 1, cmax: 14, combo: 6
>>>
>>
>> I investigated using the pgdata from the crashed run (can provide, if
>> you have rpi4 or some other aarch64 machine), and the reason is pretty
>> simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
>> after the subxact assignment, so we add both xids as toplevel.
>>
>> That seems broken - if we skip the assignment like this, doesn't that
>> break spill-to-disk and/or streaming? IIRC that's exactly why we had to
>> start logging assignments immediately with wal_level=logical.
>>
> 
> We had started logging assignments immediately in commit 0bead9af48
> for streaming transactions in PG14. This issue exists prior to that. I
> have tried and reproduced it in PG13 but I think it will be there even
> before that. So, I am not sure if the spilling behavior is broken due
> to this. I think if we don't get assignment recording before
> processing changes during decoding commit then we could miss sending
> the changes which won't be the case here. Do you see any other
> problem?
> 

I can't, but that's hardly a proof of anything. You're right spilling to
disk may not be broken by this, though. I forgot it precedes assignments
being logged immediately, so it does not rely on that.

>> Or maybe we're not dealing with the restart_lsn properly, and we should
>> have ignored those records. Both xacts started long before the restart
>> LSN, so we're not seeing the whole xact anyway.
>>
> 
> Right, but is that problematic? The restart LSN will be used as a
> point to start reading the WAL and that helps in building a consistent
> snapshot. However, for decoding to send the commit, we use
> start_decoding_at point which will ensure that we send complete
> transactions.
> 

Which part would not be problematic? There's some sort of a bug, that's
for sure.

I think it's mostly clear we won't output this transaction, because the
restart LSN is half-way through. We can either ignore it at commit time,
and then we have to make everything work in case we miss assignments (or
any other part of the transaction).

Or we can ignore stuff early, and not even process some of the changes.
For example in this case do we need to process the NEW_CID contents for
transaction 848? If we can skip that bit, the problem will disappear.

But maybe this is futile and there are other similar issues ...

>> However, when processing the NEW_CID record:
>>
>> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
>> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
>>
>> we ultimately do this in SnapBuildProcessNewCid:
>>
>> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
>> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
>> combocid=6) at reorderbuffer.c:3218
>> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
>> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
>>
>> so in fact we *know* 849 is a subxact of 848, but we don't call
>> ReorderBufferAssignChild in this case. In fact we can't even do the
>> assignment easily in this case, because we create the subxact first, so
>> that the crash happens right when we attempt to create the toplevel one,
>> and we never even get a chance to do the assignment:
>>
>> 1) process the NEW_CID record, logged for 849 (subxact)
>> 2) process CIDs in the WAL record, which has topleve_xid 848
>>
>>
>> So IMHO we need to figure out what to do for WAL records that create
>> both the toplevel and subxact - either we need to skip them, or rethink
>> how we create the ReorderBufferTXN structs.
>>
> 
> As per my understanding, we can't skip them as they are used to build
> the snapshot.
> 

Don't we know 848 (the top-level xact) won't be decoded? In that case we
won't need the snapshot, so why build it? Of course, the NEW_CID is
logged with xid 849 and we don't know it's subxact of 848, but when
processing the NEW_CID content we can realize that (xl_heap_new_cid does
include top_xid).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/5/22 06:32, Amit Kapila wrote:
> > On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> On 9/4/22 14:24, Tomas Vondra wrote:
> >>>
> >>>> As per
> >>>> my understanding, the problem I reported in the email [1] is the same
> >>>> and we have seen this in BF failures as well. I posted a way to
> >>>> reproduce it in that email. It seems this is possible if the decoding
> >>>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
> >>>> subtransaction) after XLOG_RUNNING_XACTS.
> >>>>
> >>>
> >>> Interesting. That's certainly true for WAL in the crashing case:
> >>>
> >>> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> >>> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
> >>> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
> >>> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
> >>> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
> >>> 58/38; cmin: 1, cmax: 14, combo: 6
> >>>
> >>
> >> I investigated using the pgdata from the crashed run (can provide, if
> >> you have rpi4 or some other aarch64 machine), and the reason is pretty
> >> simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
> >> after the subxact assignment, so we add both xids as toplevel.
> >>
> >> That seems broken - if we skip the assignment like this, doesn't that
> >> break spill-to-disk and/or streaming? IIRC that's exactly why we had to
> >> start logging assignments immediately with wal_level=logical.
> >>
> >
> > We had started logging assignments immediately in commit 0bead9af48
> > for streaming transactions in PG14. This issue exists prior to that. I
> > have tried and reproduced it in PG13 but I think it will be there even
> > before that. So, I am not sure if the spilling behavior is broken due
> > to this. I think if we don't get assignment recording before
> > processing changes during decoding commit then we could miss sending
> > the changes which won't be the case here. Do you see any other
> > problem?
> >
>
> I can't, but that's hardly a proof of anything. You're right spilling to
> disk may not be broken by this, though. I forgot it precedes assignments
> being logged immediately, so it does not rely on that.
>
> >> Or maybe we're not dealing with the restart_lsn properly, and we should
> >> have ignored those records. Both xacts started long before the restart
> >> LSN, so we're not seeing the whole xact anyway.
> >>
> >
> > Right, but is that problematic? The restart LSN will be used as a
> > point to start reading the WAL and that helps in building a consistent
> > snapshot. However, for decoding to send the commit, we use
> > start_decoding_at point which will ensure that we send complete
> > transactions.
> >
>
> Which part would not be problematic? There's some sort of a bug, that's
> for sure.
>

It is possible that there is some other problem here that I am
missing. But at this stage, I don't see anything wrong other than the
assertion you have reported.

> I think it's mostly clear we won't output this transaction, because the
> restart LSN is half-way through. We can either ignore it at commit time,
> and then we have to make everything work in case we miss assignments (or
> any other part of the transaction).
>

Note, traditionally, we only form these assignments at commit time
after deciding whether to skip such commits. So, ideally, there
shouldn't be any fundamental problem with not making these
associations before deciding whether we need to replay (send
downstream) any particular transaction.

> Or we can ignore stuff early, and not even process some of the changes.
> For example in this case do we need to process the NEW_CID contents for
> transaction 848? If we can skip that bit, the problem will disappear.
>
> But maybe this is futile and there are other similar issues ...
>
> >> However, when processing the NEW_CID record:
> >>
> >> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
> >> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
> >>
> >> we ultimately do this in SnapBuildProcessNewCid:
> >>
> >> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
> >> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
> >> combocid=6) at reorderbuffer.c:3218
> >> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
> >> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
> >>
> >> so in fact we *know* 849 is a subxact of 848, but we don't call
> >> ReorderBufferAssignChild in this case. In fact we can't even do the
> >> assignment easily in this case, because we create the subxact first, so
> >> that the crash happens right when we attempt to create the toplevel one,
> >> and we never even get a chance to do the assignment:
> >>
> >> 1) process the NEW_CID record, logged for 849 (subxact)
> >> 2) process CIDs in the WAL record, which has topleve_xid 848
> >>
> >>
> >> So IMHO we need to figure out what to do for WAL records that create
> >> both the toplevel and subxact - either we need to skip them, or rethink
> >> how we create the ReorderBufferTXN structs.
> >>
> >
> > As per my understanding, we can't skip them as they are used to build
> > the snapshot.
> >
>
> Don't we know 848 (the top-level xact) won't be decoded? In that case we
> won't need the snapshot, so why build it?
>

But this transaction id can be part of committed.xip array if it has
made any catalog changes. We add the transaction/subtransaction to
this array before deciding whether to skip decoding/replay of its
commit.

-- 
With Regards,
Amit Kapila.




On 9/5/22 08:35, Amit Kapila wrote:
> On Sun, Sep 4, 2022 at 11:10 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 9/4/22 16:08, Tomas Vondra wrote:
>>> ...
>>>
>>> so in fact we *know* 849 is a subxact of 848, but we don't call
>>> ReorderBufferAssignChild in this case. In fact we can't even do the
>>> assignment easily in this case, because we create the subxact first, so
>>> that the crash happens right when we attempt to create the toplevel one,
>>> and we never even get a chance to do the assignment:
>>>
>>> 1) process the NEW_CID record, logged for 849 (subxact)
>>> 2) process CIDs in the WAL record, which has topleve_xid 848
>>>
>>>
>>> So IMHO we need to figure out what to do for WAL records that create
>>> both the toplevel and subxact - either we need to skip them, or rethink
>>> how we create the ReorderBufferTXN structs.
>>>
>>
>> This fixes the crash for me, by adding a ReorderBufferAssignChild call
>> to SnapBuildProcessNewCid, and tweaking ReorderBufferAssignChild to
>> ensure we don't try to create the top xact before updating the subxact
>> and removing it from the toplevel_by_lsn list.
>>
>> Essentially, what's happening is this:
>>
>> 1) We read the NEW_CID record, which is logged with XID 849, i.e. the
>> subxact. But we don't know it's a subxact, so we create it as a
>> top-level xact with the LSN.
>>
>> 2) We start processing contents of the NEW_CID, which however has info
>> that 849 is subxact of 848, calls ReorderBufferAddNewTupleCids which
>> promptly does ReorderBufferTXNByXid() with the top-level XID, which
>> creates it with the same LSN, and crashes because of the assert.
>>
>> I'm not sure what's the right/proper way to fix this ...
>>
>> The problem is ReorderBufferAssignChild was coded in a way that did not
>> expect the subxact to be created first (as a top-level xact).
>>
> 
> I think there was a previously hard-coded way to detect that and we
> have removed it in commit
> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=e3ff789acfb2754cd7b5e87f6f4463fd08e35996).
> I think it is possible that subtransaction gets logged without
> previous top-level txn record as shown in the commit shared.
> 

Well, yes and no.

This wouldn't detect the issue, because the assert happens in the first
ReorderBufferTXNByXid(), so it's still crash (in assert-enabled build,
at least).

Maybe removing the assumption was the wrong thing, and we should have
changed the code so that we don't violate it? That's kinda what my "fix"
does, in a way.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




On 9/5/22 12:12, Amit Kapila wrote:
> On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 9/5/22 06:32, Amit Kapila wrote:
>>> On Sun, Sep 4, 2022 at 7:38 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>> On 9/4/22 14:24, Tomas Vondra wrote:
>>>>>
>>>>>> As per
>>>>>> my understanding, the problem I reported in the email [1] is the same
>>>>>> and we have seen this in BF failures as well. I posted a way to
>>>>>> reproduce it in that email. It seems this is possible if the decoding
>>>>>> gets XLOG_HEAP2_NEW_CID as the first record (belonging to a
>>>>>> subtransaction) after XLOG_RUNNING_XACTS.
>>>>>>
>>>>>
>>>>> Interesting. That's certainly true for WAL in the crashing case:
>>>>>
>>>>> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
>>>>> 0/01CDCD70, prev 0/01CDCD10, desc: RUNNING_XACTS nextXid 850
>>>>> latestCompletedXid 847 oldestRunningXid 848; 1 xacts: 848
>>>>> rmgr: Heap2       len (rec/tot):     60/    60, tx:        849, lsn:
>>>>> 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel 1663/16384/1249; tid
>>>>> 58/38; cmin: 1, cmax: 14, combo: 6
>>>>>
>>>>
>>>> I investigated using the pgdata from the crashed run (can provide, if
>>>> you have rpi4 or some other aarch64 machine), and the reason is pretty
>>>> simple - the restart_lsn for the slot is 0/1CDCD70, which is looong
>>>> after the subxact assignment, so we add both xids as toplevel.
>>>>
>>>> That seems broken - if we skip the assignment like this, doesn't that
>>>> break spill-to-disk and/or streaming? IIRC that's exactly why we had to
>>>> start logging assignments immediately with wal_level=logical.
>>>>
>>>
>>> We had started logging assignments immediately in commit 0bead9af48
>>> for streaming transactions in PG14. This issue exists prior to that. I
>>> have tried and reproduced it in PG13 but I think it will be there even
>>> before that. So, I am not sure if the spilling behavior is broken due
>>> to this. I think if we don't get assignment recording before
>>> processing changes during decoding commit then we could miss sending
>>> the changes which won't be the case here. Do you see any other
>>> problem?
>>>
>>
>> I can't, but that's hardly a proof of anything. You're right spilling to
>> disk may not be broken by this, though. I forgot it precedes assignments
>> being logged immediately, so it does not rely on that.
>>
>>>> Or maybe we're not dealing with the restart_lsn properly, and we should
>>>> have ignored those records. Both xacts started long before the restart
>>>> LSN, so we're not seeing the whole xact anyway.
>>>>
>>>
>>> Right, but is that problematic? The restart LSN will be used as a
>>> point to start reading the WAL and that helps in building a consistent
>>> snapshot. However, for decoding to send the commit, we use
>>> start_decoding_at point which will ensure that we send complete
>>> transactions.
>>>
>>
>> Which part would not be problematic? There's some sort of a bug, that's
>> for sure.
>>
> 
> It is possible that there is some other problem here that I am
> missing. But at this stage, I don't see anything wrong other than the
> assertion you have reported.
> 

I'm not sure I agree with that. I'm not convinced the assert is at
fault, it might easily be that it hints there's a logic bug somewhere.

>> I think it's mostly clear we won't output this transaction, because the
>> restart LSN is half-way through. We can either ignore it at commit time,
>> and then we have to make everything work in case we miss assignments (or
>> any other part of the transaction).
>>
> 
> Note, traditionally, we only form these assignments at commit time
> after deciding whether to skip such commits. So, ideally, there
> shouldn't be any fundamental problem with not making these
> associations before deciding whether we need to replay (send
> downstream) any particular transaction.
> 

Isn't that self-contradictory? Either we form these assignments at
commit time, or we support streaming (in which case it clearly can't
happen at commit time). AFAICS that's exactly why we started logging
(and processing) assignments immediately, no?

>> Or we can ignore stuff early, and not even process some of the changes.
>> For example in this case do we need to process the NEW_CID contents for
>> transaction 848? If we can skip that bit, the problem will disappear.
>>
>> But maybe this is futile and there are other similar issues ...
>>
>>>> However, when processing the NEW_CID record:
>>>>
>>>> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
>>>> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
>>>>
>>>> we ultimately do this in SnapBuildProcessNewCid:
>>>>
>>>> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
>>>> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
>>>> combocid=6) at reorderbuffer.c:3218
>>>> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
>>>> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
>>>>
>>>> so in fact we *know* 849 is a subxact of 848, but we don't call
>>>> ReorderBufferAssignChild in this case. In fact we can't even do the
>>>> assignment easily in this case, because we create the subxact first, so
>>>> that the crash happens right when we attempt to create the toplevel one,
>>>> and we never even get a chance to do the assignment:
>>>>
>>>> 1) process the NEW_CID record, logged for 849 (subxact)
>>>> 2) process CIDs in the WAL record, which has topleve_xid 848
>>>>
>>>>
>>>> So IMHO we need to figure out what to do for WAL records that create
>>>> both the toplevel and subxact - either we need to skip them, or rethink
>>>> how we create the ReorderBufferTXN structs.
>>>>
>>>
>>> As per my understanding, we can't skip them as they are used to build
>>> the snapshot.
>>>
>>
>> Don't we know 848 (the top-level xact) won't be decoded? In that case we
>> won't need the snapshot, so why build it?
>>
> 
> But this transaction id can be part of committed.xip array if it has
> made any catalog changes. We add the transaction/subtransaction to
> this array before deciding whether to skip decoding/replay of its
> commit.
> 

Hmm, yeah. It's been a while since I last looked into how we build
snapshots and how we share them between the transactions :-( If we share
the snapshots between transactions, you're probably right we can't just
skip these changes.

However, doesn't that pretty much mean we *have* to do something about
the assignment? I mean, suppose we miss the assignment (like now), so
that we end up with two TXNs that we think are top-level. And then we
get the commit for the actual top-level transaction. AFAICS that won't
clean-up the subxact, and we end up with a lingering TXN.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 9/5/22 12:12, Amit Kapila wrote:
> > On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >
> > It is possible that there is some other problem here that I am
> > missing. But at this stage, I don't see anything wrong other than the
> > assertion you have reported.
> >
>
> I'm not sure I agree with that. I'm not convinced the assert is at
> fault, it might easily be that it hints there's a logic bug somewhere.
>

It is possible but let's try to prove it. I am also keen to know if
this hints at a logic bug somewhere.

> >> I think it's mostly clear we won't output this transaction, because the
> >> restart LSN is half-way through. We can either ignore it at commit time,
> >> and then we have to make everything work in case we miss assignments (or
> >> any other part of the transaction).
> >>
> >
> > Note, traditionally, we only form these assignments at commit time
> > after deciding whether to skip such commits. So, ideally, there
> > shouldn't be any fundamental problem with not making these
> > associations before deciding whether we need to replay (send
> > downstream) any particular transaction.
> >
>
> Isn't that self-contradictory? Either we form these assignments at
> commit time, or we support streaming (in which case it clearly can't
> happen at commit time).
>

I was talking about non-streaming cases which also have this assert
problem as seen in this thread. I am intentionally keeping streaming
cases out of this discussion as it happens without those and by
including streaming in the discussion, we will add another angle to
this problem which may not be required.

> AFAICS that's exactly why we started logging
> (and processing) assignments immediately, no?
>
> >> Or we can ignore stuff early, and not even process some of the changes.
> >> For example in this case do we need to process the NEW_CID contents for
> >> transaction 848? If we can skip that bit, the problem will disappear.
> >>
> >> But maybe this is futile and there are other similar issues ...
> >>
> >>>> However, when processing the NEW_CID record:
> >>>>
> >>>> tx:        849, lsn: 0/01CDCDB0, prev 0/01CDCD70, desc: NEW_CID rel
> >>>> 1663/16384/1249; tid 58/38; cmin: 1, cmax: 14, combo: 6
> >>>>
> >>>> we ultimately do this in SnapBuildProcessNewCid:
> >>>>
> >>>> #1  0x0000005566cccdb4 in ReorderBufferAddNewTupleCids (rb=0x559dd64218,
> >>>> xid=848, lsn=30264752, locator=..., tid=..., cmin=1, cmax=14,
> >>>> combocid=6) at reorderbuffer.c:3218
> >>>> #2  0x0000005566cd1f7c in SnapBuildProcessNewCid (builder=0x559dd6a248,
> >>>> xid=849, lsn=30264752, xlrec=0x559dd6e1e0) at snapbuild.c:818
> >>>>
> >>>> so in fact we *know* 849 is a subxact of 848, but we don't call
> >>>> ReorderBufferAssignChild in this case. In fact we can't even do the
> >>>> assignment easily in this case, because we create the subxact first, so
> >>>> that the crash happens right when we attempt to create the toplevel one,
> >>>> and we never even get a chance to do the assignment:
> >>>>
> >>>> 1) process the NEW_CID record, logged for 849 (subxact)
> >>>> 2) process CIDs in the WAL record, which has topleve_xid 848
> >>>>
> >>>>
> >>>> So IMHO we need to figure out what to do for WAL records that create
> >>>> both the toplevel and subxact - either we need to skip them, or rethink
> >>>> how we create the ReorderBufferTXN structs.
> >>>>
> >>>
> >>> As per my understanding, we can't skip them as they are used to build
> >>> the snapshot.
> >>>
> >>
> >> Don't we know 848 (the top-level xact) won't be decoded? In that case we
> >> won't need the snapshot, so why build it?
> >>
> >
> > But this transaction id can be part of committed.xip array if it has
> > made any catalog changes. We add the transaction/subtransaction to
> > this array before deciding whether to skip decoding/replay of its
> > commit.
> >
>
> Hmm, yeah. It's been a while since I last looked into how we build
> snapshots and how we share them between the transactions :-( If we share
> the snapshots between transactions, you're probably right we can't just
> skip these changes.
>
> However, doesn't that pretty much mean we *have* to do something about
> the assignment? I mean, suppose we miss the assignment (like now), so
> that we end up with two TXNs that we think are top-level. And then we
> get the commit for the actual top-level transaction. AFAICS that won't
> clean-up the subxact, and we end up with a lingering TXN.
>

I think we will clean up such a subxact. Such a xact should be skipped
via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
for each of the subxacts and that will make sure that we clean up each
of subtxn's.

-- 
With Regards,
Amit Kapila.



Hi,

On Tue, Sep 6, 2022 at 3:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Sep 5, 2022 at 5:24 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 9/5/22 12:12, Amit Kapila wrote:
> > > On Mon, Sep 5, 2022 at 12:14 PM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > > It is possible that there is some other problem here that I am
> > > missing. But at this stage, I don't see anything wrong other than the
> > > assertion you have reported.
> > >
> >
> > I'm not sure I agree with that. I'm not convinced the assert is at
> > fault, it might easily be that it hints there's a logic bug somewhere.
> >
>
> It is possible but let's try to prove it. I am also keen to know if
> this hints at a logic bug somewhere.
>
> > >> I think it's mostly clear we won't output this transaction, because the
> > >> restart LSN is half-way through. We can either ignore it at commit time,
> > >> and then we have to make everything work in case we miss assignments (or
> > >> any other part of the transaction).
> > >>
> > >
> > > Note, traditionally, we only form these assignments at commit time
> > > after deciding whether to skip such commits. So, ideally, there
> > > shouldn't be any fundamental problem with not making these
> > > associations before deciding whether we need to replay (send
> > > downstream) any particular transaction.

Agreed.

Summarizing this issue, the assertion check in AssertTXNLsnOrder()
fails as reported because the current logical decoding cannot properly
handle the case where the decoding restarts from NEW_CID. Since we
don't make the association between top-level transaction and its
subtransaction while decoding NEW_CID (ie, in
SnapBuildProcessNewCid()), two transactions are created in
ReorderBuffer as top-txn and have the same LSN. This failure happens
on all supported versions.

To fix the problem, one idea is that we make the association between
top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
as Tomas proposed. On the other hand, since we don't guarantee to make
the association between the top-level transaction and its
sub-transactions until we try to decode the actual contents of the
transaction, it makes sense to me that instead of trying to solve by
making association, we need to change the code which are assuming that
it is associated.

I've attached the patch for this idea. With the patch, we skip the
assertion checks in AssertTXNLsnOrder() until we reach the LSN at
which we start decoding the contents of transaction, ie.
start_decoding_at in SnapBuild. The minor concern is other way that
the assertion check could miss some faulty cases where two unrelated
top-transactions could have same LSN. With this patch, it will pass
for such a case. Therefore, for transactions that we skipped checking,
we do the check when we reach the LSN.

Please note that to pass the new regression tests, the fix proposed in
a related thread[1] is required. Particularly, we need:

@@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
XLogRecPtr lsn, TransactionId xid,
        else if (sub_needs_timetravel)
        {
                /* track toplevel txn as well, subxact alone isn't meaningful */
+               elog(DEBUG2, "forced transaction %u to do timetravel
due to one of its subtransaction",
+                        xid);
+               needs_timetravel = true;
                SnapBuildAddCommittedTxn(builder, xid);
        }
        else if (needs_timetravel)

A side benefit of this approach is that we can fix another assertion
failure too that happens on REL14 and REL15 and reported here[2]. In
the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
we make the association between sub-txns to top-txn in
SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
in AssertTXNLsnOrder(). However, since the invalidation messages are
not transported from sub-txn to top-txn during the assignment, another
assertion check in ReorderBufferForget() fails when forgetting the
subtransaction. If we apply this idea of skipping the assertion
checks, we no longer need to make the such association in
SnapBuildXidSetCatalogChanges() and resolve this issue as well.

> > >>>
> > >>
> > >> Don't we know 848 (the top-level xact) won't be decoded? In that case we
> > >> won't need the snapshot, so why build it?
> > >>
> > >
> > > But this transaction id can be part of committed.xip array if it has
> > > made any catalog changes. We add the transaction/subtransaction to
> > > this array before deciding whether to skip decoding/replay of its
> > > commit.
> > >
> >
> > Hmm, yeah. It's been a while since I last looked into how we build
> > snapshots and how we share them between the transactions :-( If we share
> > the snapshots between transactions, you're probably right we can't just
> > skip these changes.
> >
> > However, doesn't that pretty much mean we *have* to do something about
> > the assignment? I mean, suppose we miss the assignment (like now), so
> > that we end up with two TXNs that we think are top-level. And then we
> > get the commit for the actual top-level transaction. AFAICS that won't
> > clean-up the subxact, and we end up with a lingering TXN.
> >
>
> I think we will clean up such a subxact. Such a xact should be skipped
> via DecodeTXNNeedSkip() and then it will call ReorderBufferForget()
> for each of the subxacts and that will make sure that we clean up each
> of subtxn's.
>

Right.

Regards,

[1]
https://www.postgresql.org/message-id/TYAPR01MB58666BD6BE24853269624282F5419%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB58660803BCAA7849C8584AA4F57E9%40TYAPR01MB5866.jpnprd01.prod.outlook.com


--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> fails as reported because the current logical decoding cannot properly
> handle the case where the decoding restarts from NEW_CID. Since we
> don't make the association between top-level transaction and its
> subtransaction while decoding NEW_CID (ie, in
> SnapBuildProcessNewCid()), two transactions are created in
> ReorderBuffer as top-txn and have the same LSN. This failure happens
> on all supported versions.
>
> To fix the problem, one idea is that we make the association between
> top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> as Tomas proposed. On the other hand, since we don't guarantee to make
> the association between the top-level transaction and its
> sub-transactions until we try to decode the actual contents of the
> transaction, it makes sense to me that instead of trying to solve by
> making association, we need to change the code which are assuming that
> it is associated.
>
> I've attached the patch for this idea. With the patch, we skip the
> assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> which we start decoding the contents of transaction, ie.
> start_decoding_at in SnapBuild. The minor concern is other way that
> the assertion check could miss some faulty cases where two unrelated
> top-transactions could have same LSN. With this patch, it will pass
> for such a case. Therefore, for transactions that we skipped checking,
> we do the check when we reach the LSN.
>

>
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -113,6 +113,15 @@
LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
XLogReaderState *recor
  buf.origptr);
  }

+#ifdef USE_ASSERT_CHECKING
+ /*
+ * Check the order of transaction LSNs when we reached the start decoding
+ * LSN. See the comments in AssertTXNLsnOrder() for details.
+ */
+ if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
+ AssertTXNLsnOrder(ctx->reorder);
+#endif
+
  rmgr = GetRmgr(XLogRecGetRmid(record));
>

I am not able to think how/when this check will be useful. Because we
skipped assert checking only for records that are prior to
start_decoding_at point, I think for those records ordering should
have been checked before the restart. start_decoding_at point will be
either (a) confirmed_flush location, or (b) lsn sent by client, and
any record prior to that must have been processed before restart.

Now, say we have commit records for multiple transactions which are
after start_decoding_at but all their changes are before
start_decoding_at, then we won't check their ordering at commit time
but OTOH, we would have checked their ordering before restart. Isn't
that sufficient?

-- 
With Regards,
Amit Kapila.



On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> > fails as reported because the current logical decoding cannot properly
> > handle the case where the decoding restarts from NEW_CID. Since we
> > don't make the association between top-level transaction and its
> > subtransaction while decoding NEW_CID (ie, in
> > SnapBuildProcessNewCid()), two transactions are created in
> > ReorderBuffer as top-txn and have the same LSN. This failure happens
> > on all supported versions.
> >
> > To fix the problem, one idea is that we make the association between
> > top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> > as Tomas proposed. On the other hand, since we don't guarantee to make
> > the association between the top-level transaction and its
> > sub-transactions until we try to decode the actual contents of the
> > transaction, it makes sense to me that instead of trying to solve by
> > making association, we need to change the code which are assuming that
> > it is associated.
> >
> > I've attached the patch for this idea. With the patch, we skip the
> > assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> > which we start decoding the contents of transaction, ie.
> > start_decoding_at in SnapBuild. The minor concern is other way that
> > the assertion check could miss some faulty cases where two unrelated
> > top-transactions could have same LSN. With this patch, it will pass
> > for such a case. Therefore, for transactions that we skipped checking,
> > we do the check when we reach the LSN.
> >
>
> >
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -113,6 +113,15 @@
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *recor
>   buf.origptr);
>   }
>
> +#ifdef USE_ASSERT_CHECKING
> + /*
> + * Check the order of transaction LSNs when we reached the start decoding
> + * LSN. See the comments in AssertTXNLsnOrder() for details.
> + */
> + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> + AssertTXNLsnOrder(ctx->reorder);
> +#endif
> +
>   rmgr = GetRmgr(XLogRecGetRmid(record));
> >
>
> I am not able to think how/when this check will be useful. Because we
> skipped assert checking only for records that are prior to
> start_decoding_at point, I think for those records ordering should
> have been checked before the restart. start_decoding_at point will be
> either (a) confirmed_flush location, or (b) lsn sent by client, and
> any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Please note that to pass the new regression tests, the fix proposed in
> a related thread[1] is required. Particularly, we need:
>
> @@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
> XLogRecPtr lsn, TransactionId xid,
>         else if (sub_needs_timetravel)
>         {
>                 /* track toplevel txn as well, subxact alone isn't meaningful */
> +               elog(DEBUG2, "forced transaction %u to do timetravel
> due to one of its subtransaction",
> +                        xid);
> +               needs_timetravel = true;
>                 SnapBuildAddCommittedTxn(builder, xid);
>         }
>         else if (needs_timetravel)
>
> A side benefit of this approach is that we can fix another assertion
> failure too that happens on REL14 and REL15 and reported here[2]. In
> the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
> we make the association between sub-txns to top-txn in
> SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
> in AssertTXNLsnOrder(). However, since the invalidation messages are
> not transported from sub-txn to top-txn during the assignment, another
> assertion check in ReorderBufferForget() fails when forgetting the
> subtransaction. If we apply this idea of skipping the assertion
> checks, we no longer need to make the such association in
> SnapBuildXidSetCatalogChanges() and resolve this issue as well.
>

IIUC, here you are speaking of three different changes. Change-1: Add
a check in AssertTXNLsnOrder() to skip assert checking till we reach
start_decoding_at. Change-2: Set needs_timetravel to true in one of
the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
in PG-14/15 as that won't be required after Change-1.

AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
required in HEAD/v15/v14 to fix the problem. Now, the second and third
changes are not required in branches prior to v14 because we don't
record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
we want, we can even back-patch Change-2 and Change-3 to keep the code
consistent or maybe just Change-3.

Is my understanding correct?


-- 
With Regards,
Amit Kapila.



On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Please note that to pass the new regression tests, the fix proposed in
> > a related thread[1] is required. Particularly, we need:
> >
> > @@ -1099,6 +1099,9 @@ SnapBuildCommitTxn(SnapBuild *builder,
> > XLogRecPtr lsn, TransactionId xid,
> >         else if (sub_needs_timetravel)
> >         {
> >                 /* track toplevel txn as well, subxact alone isn't meaningful */
> > +               elog(DEBUG2, "forced transaction %u to do timetravel
> > due to one of its subtransaction",
> > +                        xid);
> > +               needs_timetravel = true;
> >                 SnapBuildAddCommittedTxn(builder, xid);
> >         }
> >         else if (needs_timetravel)
> >
> > A side benefit of this approach is that we can fix another assertion
> > failure too that happens on REL14 and REL15 and reported here[2]. In
> > the commits 68dcce247f1a(REL14) and 272248a0c1(REL15), the reason why
> > we make the association between sub-txns to top-txn in
> > SnapBuildXidSetCatalogChanges() is just to avoid the assertion failure
> > in AssertTXNLsnOrder(). However, since the invalidation messages are
> > not transported from sub-txn to top-txn during the assignment, another
> > assertion check in ReorderBufferForget() fails when forgetting the
> > subtransaction. If we apply this idea of skipping the assertion
> > checks, we no longer need to make the such association in
> > SnapBuildXidSetCatalogChanges() and resolve this issue as well.
> >
>
> IIUC, here you are speaking of three different changes. Change-1: Add
> a check in AssertTXNLsnOrder() to skip assert checking till we reach
> start_decoding_at. Change-2: Set needs_timetravel to true in one of
> the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> in PG-14/15 as that won't be required after Change-1.

Yes.

>
> AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> required in HEAD/v15/v14 to fix the problem.

IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
v14. The reason why we need Change-2 is that there is a case where we
mark only subtransactions as containing catalog change while not doing
that for its top-level transaction. In v15 and v14, since we mark both
subtransactions and top-level transaction in
SnapBuildXidSetCatalogChanges() as containing catalog changes, we
don't get the assertion failure at "Assert(!needs_snapshot ||
needs_timetravel)".

Regarding Change-3, it's required in v15 and v14 but not in HEAD and
v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
HEAD, Change-3 cannot be applied to the two branches.

> Now, the second and third
> changes are not required in branches prior to v14 because we don't
> record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> we want, we can even back-patch Change-2 and Change-3 to keep the code
> consistent or maybe just Change-3.

Right. I don't think it's a good idea to back-patch Change-2 in
branches prior to v14 as it's not a relevant issue. Regarding
back-patching Change-3 to branches prior 14, I think it may be okay
til v11, but I'd be hesitant for v10 as the final release comes in a
month.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Dear Sawada-san, Amit,

> IIUC Change-2 is required in v16 and HEAD but not mandatory in v15 and
> v14. The reason why we need Change-2 is that there is a case where we
> mark only subtransactions as containing catalog change while not doing
> that for its top-level transaction. In v15 and v14, since we mark both
> subtransactions and top-level transaction in
> SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> don't get the assertion failure at "Assert(!needs_snapshot ||
> needs_timetravel)".

Incidentally, I agreed that Change-2 is needed for HEAD (and v16), not v15 and v14.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > IIUC, here you are speaking of three different changes. Change-1: Add
> > a check in AssertTXNLsnOrder() to skip assert checking till we reach
> > start_decoding_at. Change-2: Set needs_timetravel to true in one of
> > the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> > call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> > in PG-14/15 as that won't be required after Change-1.
>
> Yes.
>
> >
> > AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> > required in HEAD/v15/v14 to fix the problem.
>
> IIUC Change-2 is required in v16 and HEAD
>

Why are you referring v16 and HEAD separately?

> but not mandatory in v15 and
> v14. The reason why we need Change-2 is that there is a case where we
> mark only subtransactions as containing catalog change while not doing
> that for its top-level transaction. In v15 and v14, since we mark both
> subtransactions and top-level transaction in
> SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> don't get the assertion failure at "Assert(!needs_snapshot ||
> needs_timetravel)".
>
> Regarding Change-3, it's required in v15 and v14 but not in HEAD and
> v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
> HEAD, Change-3 cannot be applied to the two branches.
>
> > Now, the second and third
> > changes are not required in branches prior to v14 because we don't
> > record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> > we want, we can even back-patch Change-2 and Change-3 to keep the code
> > consistent or maybe just Change-3.
>
> Right. I don't think it's a good idea to back-patch Change-2 in
> branches prior to v14 as it's not a relevant issue.
>

Fair enough but then why to even backpatch it to v15 and v14?

> Regarding
> back-patching Change-3 to branches prior 14, I think it may be okay
> til v11, but I'd be hesitant for v10 as the final release comes in a
> month.
>

So to fix the issue in all branches, what we need to do is to
backpatch change-1: in all branches till v10, change-2: in HEAD, and
change-3: in V15 and V14. Additionally, we think, it is okay to
backpatch change-3 till v11 as it is mainly done to avoid the problem
fixed by change-1 and it makes code consistent in back branches.

I think because the test case proposed needs all three changes, we can
push the change-1 without a test case and then as a second patch have
change-2 for HEAD and change-3 for back branches with the test case.
Do you have any other ideas to proceed here?

-- 
With Regards,
Amit Kapila.



On Tue, Oct 18, 2022 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 6:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 17, 2022 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > IIUC, here you are speaking of three different changes. Change-1: Add
> > > a check in AssertTXNLsnOrder() to skip assert checking till we reach
> > > start_decoding_at. Change-2: Set needs_timetravel to true in one of
> > > the else if branches in SnapBuildCommitTxn(). Change-3: Remove the
> > > call to ReorderBufferAssignChild() from SnapBuildXidSetCatalogChanges
> > > in PG-14/15 as that won't be required after Change-1.
> >
> > Yes.
> >
> > >
> > > AFAIU, Change-1 is required till v10; Change-2 and Change-3 are
> > > required in HEAD/v15/v14 to fix the problem.
> >
> > IIUC Change-2 is required in v16 and HEAD
> >
>
> Why are you referring v16 and HEAD separately?

Sorry, my wrong, I was confused.

>
> > but not mandatory in v15 and
> > v14. The reason why we need Change-2 is that there is a case where we
> > mark only subtransactions as containing catalog change while not doing
> > that for its top-level transaction. In v15 and v14, since we mark both
> > subtransactions and top-level transaction in
> > SnapBuildXidSetCatalogChanges() as containing catalog changes, we
> > don't get the assertion failure at "Assert(!needs_snapshot ||
> > needs_timetravel)".
> >
> > Regarding Change-3, it's required in v15 and v14 but not in HEAD and
> > v16. Since we didn't add SnapBuildXidSetCatalogChanges() to v16 and
> > HEAD, Change-3 cannot be applied to the two branches.
> >
> > > Now, the second and third
> > > changes are not required in branches prior to v14 because we don't
> > > record invalidations via XLOG_XACT_INVALIDATIONS record. However, if
> > > we want, we can even back-patch Change-2 and Change-3 to keep the code
> > > consistent or maybe just Change-3.
> >
> > Right. I don't think it's a good idea to back-patch Change-2 in
> > branches prior to v14 as it's not a relevant issue.
> >
>
> Fair enough but then why to even backpatch it to v15 and v14?

Oops, it's a typo. I wanted to say Change-2 should be back-patched only to HEAD.

>
> > Regarding
> > back-patching Change-3 to branches prior 14, I think it may be okay
> > til v11, but I'd be hesitant for v10 as the final release comes in a
> > month.
> >
>
> So to fix the issue in all branches, what we need to do is to
> backpatch change-1: in all branches till v10, change-2: in HEAD, and
> change-3: in V15 and V14. Additionally, we think, it is okay to
> backpatch change-3 till v11 as it is mainly done to avoid the problem
> fixed by change-1 and it makes code consistent in back branches.

Right.

>
> I think because the test case proposed needs all three changes, we can
> push the change-1 without a test case and then as a second patch have
> change-2 for HEAD and change-3 for back branches with the test case.
> Do you have any other ideas to proceed here?

I found another test case that causes the assertion failure at
"Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
attached the patch for the test case. In this test case, I modified a
user-catalog table instead of system-catalog table. That way, we don't
generate invalidation messages while generating NEW_CID records. As a
result, we mark only the subtransactions as containing catalog change
and don't make association between top-level and sub transactions. The
assertion failure happens on all supported branches. If we need to fix
this (I believe so), Change-2 needs to be backpatched to all supported
branches.

There are three changes as Amit mentioned, and regarding the test
case, we have three test cases I've attached: truncate_testcase.patch,
analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
between assertion failures and test cases are very complex. I could
not find any test case to cause only one assertion failure on all
branches. One idea to proceed is:

Patch-1 includes Change-1 and is applied to all branches.

Patch-2 includes Change-2 and the user_catalog test case, and is
applied to all branches.

Patch-3 includes Change-3 and the truncate test case (or the analyze
test case), and is applied to v14 and v15 (also till v11 if we
prefer).

The patch-1 doesn't include any test case but the user_catalog test
case can test both Change-1 and Change-2 on all branches. In v15 and
v14, the analyze test case causes both the assertions at
"Assert(txn->ninvalidations == 0);" and "Assert(prev_first_lsn <
cur_txn->first_lsn);" whereas the truncate test case causes the
assertion only at "Assert(txn->ninvalidations == 0);". Since the
patch-2 is applied on top of the patch-1, there is no difference in
terms of testing Change-2.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > I think because the test case proposed needs all three changes, we can
> > push the change-1 without a test case and then as a second patch have
> > change-2 for HEAD and change-3 for back branches with the test case.
> > Do you have any other ideas to proceed here?
>
> I found another test case that causes the assertion failure at
> "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> attached the patch for the test case. In this test case, I modified a
> user-catalog table instead of system-catalog table. That way, we don't
> generate invalidation messages while generating NEW_CID records. As a
> result, we mark only the subtransactions as containing catalog change
> and don't make association between top-level and sub transactions. The
> assertion failure happens on all supported branches. If we need to fix
> this (I believe so), Change-2 needs to be backpatched to all supported
> branches.
>
> There are three changes as Amit mentioned, and regarding the test
> case, we have three test cases I've attached: truncate_testcase.patch,
> analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> between assertion failures and test cases are very complex. I could
> not find any test case to cause only one assertion failure on all
> branches. One idea to proceed is:
>
> Patch-1 includes Change-1 and is applied to all branches.
>
> Patch-2 includes Change-2 and the user_catalog test case, and is
> applied to all branches.
>
> Patch-3 includes Change-3 and the truncate test case (or the analyze
> test case), and is applied to v14 and v15 (also till v11 if we
> prefer).
>
> The patch-1 doesn't include any test case but the user_catalog test
> case can test both Change-1 and Change-2 on all branches.
>

I was wondering if it makes sense to commit both Change-1 and Change-2
together as one patch? Both assertions are caused by a single test
case and are related to the general problem that the association of
top and sub transaction is only guaranteed to be formed before we
decode transaction changes. Also, it would be good to fix the problem
with a test case that can cause it. What do you think?

-- 
With Regards,
Amit Kapila.



On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > --- a/src/backend/replication/logical/decode.c
> > +++ b/src/backend/replication/logical/decode.c
> > @@ -113,6 +113,15 @@
> > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > XLogReaderState *recor
> >   buf.origptr);
> >   }
> >
> > +#ifdef USE_ASSERT_CHECKING
> > + /*
> > + * Check the order of transaction LSNs when we reached the start decoding
> > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > + */
> > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > + AssertTXNLsnOrder(ctx->reorder);
> > +#endif
> > +
> >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > >
> >
> > I am not able to think how/when this check will be useful. Because we
> > skipped assert checking only for records that are prior to
> > start_decoding_at point, I think for those records ordering should
> > have been checked before the restart. start_decoding_at point will be
> > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > any record prior to that must have been processed before restart.
>
> Good point. I was considering the case where the client sets far ahead
> LSN but it's not worth considering this case in this context. I've
> updated the patch accoringly.
>

One minor comment:
Can we slightly change the comment: ". The ordering of the records
prior to the LSN, we should have been checked before the restart." to
". The ordering of the records prior to the start_decoding_at LSN
should have been checked before the restart."?

-- 
With Regards,
Amit Kapila.



On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > >
> > > I think because the test case proposed needs all three changes, we can
> > > push the change-1 without a test case and then as a second patch have
> > > change-2 for HEAD and change-3 for back branches with the test case.
> > > Do you have any other ideas to proceed here?
> >
> > I found another test case that causes the assertion failure at
> > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > attached the patch for the test case. In this test case, I modified a
> > user-catalog table instead of system-catalog table. That way, we don't
> > generate invalidation messages while generating NEW_CID records. As a
> > result, we mark only the subtransactions as containing catalog change
> > and don't make association between top-level and sub transactions. The
> > assertion failure happens on all supported branches. If we need to fix
> > this (I believe so), Change-2 needs to be backpatched to all supported
> > branches.
> >
> > There are three changes as Amit mentioned, and regarding the test
> > case, we have three test cases I've attached: truncate_testcase.patch,
> > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > between assertion failures and test cases are very complex. I could
> > not find any test case to cause only one assertion failure on all
> > branches. One idea to proceed is:
> >
> > Patch-1 includes Change-1 and is applied to all branches.
> >
> > Patch-2 includes Change-2 and the user_catalog test case, and is
> > applied to all branches.
> >
> > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > test case), and is applied to v14 and v15 (also till v11 if we
> > prefer).
> >
> > The patch-1 doesn't include any test case but the user_catalog test
> > case can test both Change-1 and Change-2 on all branches.
> >
>
> I was wondering if it makes sense to commit both Change-1 and Change-2
> together as one patch? Both assertions are caused by a single test
> case and are related to the general problem that the association of
> top and sub transaction is only guaranteed to be formed before we
> decode transaction changes. Also, it would be good to fix the problem
> with a test case that can cause it. What do you think?

Yeah, it makes sense to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Oct 18, 2022 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 17, 2022 at 7:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > --- a/src/backend/replication/logical/decode.c
> > > +++ b/src/backend/replication/logical/decode.c
> > > @@ -113,6 +113,15 @@
> > > LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> > > XLogReaderState *recor
> > >   buf.origptr);
> > >   }
> > >
> > > +#ifdef USE_ASSERT_CHECKING
> > > + /*
> > > + * Check the order of transaction LSNs when we reached the start decoding
> > > + * LSN. See the comments in AssertTXNLsnOrder() for details.
> > > + */
> > > + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> > > + AssertTXNLsnOrder(ctx->reorder);
> > > +#endif
> > > +
> > >   rmgr = GetRmgr(XLogRecGetRmid(record));
> > > >
> > >
> > > I am not able to think how/when this check will be useful. Because we
> > > skipped assert checking only for records that are prior to
> > > start_decoding_at point, I think for those records ordering should
> > > have been checked before the restart. start_decoding_at point will be
> > > either (a) confirmed_flush location, or (b) lsn sent by client, and
> > > any record prior to that must have been processed before restart.
> >
> > Good point. I was considering the case where the client sets far ahead
> > LSN but it's not worth considering this case in this context. I've
> > updated the patch accoringly.
> >
>
> One minor comment:
> Can we slightly change the comment: ". The ordering of the records
> prior to the LSN, we should have been checked before the restart." to
> ". The ordering of the records prior to the start_decoding_at LSN
> should have been checked before the restart."?

Agreed. I'll update the patch accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > >
> > > > I think because the test case proposed needs all three changes, we can
> > > > push the change-1 without a test case and then as a second patch have
> > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > Do you have any other ideas to proceed here?
> > >
> > > I found another test case that causes the assertion failure at
> > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > attached the patch for the test case. In this test case, I modified a
> > > user-catalog table instead of system-catalog table. That way, we don't
> > > generate invalidation messages while generating NEW_CID records. As a
> > > result, we mark only the subtransactions as containing catalog change
> > > and don't make association between top-level and sub transactions. The
> > > assertion failure happens on all supported branches. If we need to fix
> > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > branches.
> > >
> > > There are three changes as Amit mentioned, and regarding the test
> > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > between assertion failures and test cases are very complex. I could
> > > not find any test case to cause only one assertion failure on all
> > > branches. One idea to proceed is:
> > >
> > > Patch-1 includes Change-1 and is applied to all branches.
> > >
> > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > applied to all branches.
> > >
> > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > test case), and is applied to v14 and v15 (also till v11 if we
> > > prefer).
> > >
> > > The patch-1 doesn't include any test case but the user_catalog test
> > > case can test both Change-1 and Change-2 on all branches.
> > >
> >
> > I was wondering if it makes sense to commit both Change-1 and Change-2
> > together as one patch? Both assertions are caused by a single test
> > case and are related to the general problem that the association of
> > top and sub transaction is only guaranteed to be formed before we
> > decode transaction changes. Also, it would be good to fix the problem
> > with a test case that can cause it. What do you think?
>
> Yeah, it makes sense to me.
>

I've attached two patches that need to be back-patched to all branches
and includes Change-1, Change-2, and a test case for them. FYI this
patch resolves the assertion failure reported in this thread as well
as one reported in another thread[2]. So I borrowed some of the
changes from the patch[2] Osumi-san recently proposed.

Regards,

[1]
https://www.postgresql.org/message-id/TYCPR01MB83733C6CEAE47D0280814D5AED7A9%40TYCPR01MB8373.jpnprd01.prod.outlook.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB5866B30A1439043B1FC3F21EF5229%40TYAPR01MB5866.jpnprd01.prod.outlook.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

I've attached patches for Change-3 with a test case. Please review them as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Oct 18, 2022 at 9:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Oct 18, 2022 at 7:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Oct 18, 2022 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > >
> > > > > I think because the test case proposed needs all three changes, we can
> > > > > push the change-1 without a test case and then as a second patch have
> > > > > change-2 for HEAD and change-3 for back branches with the test case.
> > > > > Do you have any other ideas to proceed here?
> > > >
> > > > I found another test case that causes the assertion failure at
> > > > "Assert(!needs_snapshot || needs_timetravel);" on all branches. I've
> > > > attached the patch for the test case. In this test case, I modified a
> > > > user-catalog table instead of system-catalog table. That way, we don't
> > > > generate invalidation messages while generating NEW_CID records. As a
> > > > result, we mark only the subtransactions as containing catalog change
> > > > and don't make association between top-level and sub transactions. The
> > > > assertion failure happens on all supported branches. If we need to fix
> > > > this (I believe so), Change-2 needs to be backpatched to all supported
> > > > branches.
> > > >
> > > > There are three changes as Amit mentioned, and regarding the test
> > > > case, we have three test cases I've attached: truncate_testcase.patch,
> > > > analyze_testcase.patch, uesr_catalog_testcase.patch. The relationship
> > > > between assertion failures and test cases are very complex. I could
> > > > not find any test case to cause only one assertion failure on all
> > > > branches. One idea to proceed is:
> > > >
> > > > Patch-1 includes Change-1 and is applied to all branches.
> > > >
> > > > Patch-2 includes Change-2 and the user_catalog test case, and is
> > > > applied to all branches.
> > > >
> > > > Patch-3 includes Change-3 and the truncate test case (or the analyze
> > > > test case), and is applied to v14 and v15 (also till v11 if we
> > > > prefer).
> > > >
> > > > The patch-1 doesn't include any test case but the user_catalog test
> > > > case can test both Change-1 and Change-2 on all branches.
> > > >
> > >
> > > I was wondering if it makes sense to commit both Change-1 and Change-2
> > > together as one patch? Both assertions are caused by a single test
> > > case and are related to the general problem that the association of
> > > top and sub transaction is only guaranteed to be formed before we
> > > decode transaction changes. Also, it would be good to fix the problem
> > > with a test case that can cause it. What do you think?
> >
> > Yeah, it makes sense to me.
> >
>
> I've attached two patches that need to be back-patched to all branches
> and includes Change-1, Change-2, and a test case for them. FYI this
> patch resolves the assertion failure reported in this thread as well
> as one reported in another thread[2]. So I borrowed some of the
> changes from the patch[2] Osumi-san recently proposed.
>

Amit pointed out offlist that the changes in reorderbuffer.c is not
pgindent'ed. I've run pgindent and attached updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > I've attached two patches that need to be back-patched to all branches
> > and includes Change-1, Change-2, and a test case for them. FYI this
> > patch resolves the assertion failure reported in this thread as well
> > as one reported in another thread[2]. So I borrowed some of the
> > changes from the patch[2] Osumi-san recently proposed.
> >
>
> Amit pointed out offlist that the changes in reorderbuffer.c is not
> pgindent'ed. I've run pgindent and attached updated patches.
>

Thanks, I have tested these across all branches till v10 and it works
as expected. I am planning to push this tomorrow unless I see any
further comments.

-- 
With Regards,
Amit Kapila.



On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I've attached patches for Change-3 with a test case. Please review them as well.
>

The patch looks mostly good to me apart from few minor comments which
are as follows:
1.
+# The last decoding restarts from the first checkpoint, and add
invalidation messages
+# generated by "s0_truncate" to the subtransaction. When decoding the
commit record of
+# the top-level transaction, we mark both top-level transaction and
its subtransactions
+# as containing catalog changes. However, we check if we don't create
the association
+# between top-level and subtransactions at this time. Otherwise, we
miss executing
+# invalidation messages when forgetting the transaction.
+permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
"s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
"s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
"s1_get_changes"

The second part of this comment seems to say things more than required
which makes it less clear. How about something like: "The last
decoding restarts from the first checkpoint and adds invalidation
messages generated by "s0_truncate" to the subtransaction. While
processing the commit record for the top-level transaction, we decide
to skip this xact but ensure that corresponding invalidation messages
get processed."?

2.
+ /*
+ * We will assign subtransactions to the top transaction before
+ * replaying the contents of the transaction.
+ */

I don't think we need this comment.

-- 
With Regards,
Amit Kapila.



On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > >
> > > I've attached two patches that need to be back-patched to all branches
> > > and includes Change-1, Change-2, and a test case for them. FYI this
> > > patch resolves the assertion failure reported in this thread as well
> > > as one reported in another thread[2]. So I borrowed some of the
> > > changes from the patch[2] Osumi-san recently proposed.
> > >
> >
> > Amit pointed out offlist that the changes in reorderbuffer.c is not
> > pgindent'ed. I've run pgindent and attached updated patches.
> >
>
> Thanks, I have tested these across all branches till v10 and it works
> as expected. I am planning to push this tomorrow unless I see any
> further comments.
>

Pushed.

-- 
With Regards,
Amit Kapila.



On Thu, Oct 20, 2022 at 8:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 19, 2022 at 4:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Oct 19, 2022 at 1:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 19, 2022 at 11:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > >
> > > > I've attached two patches that need to be back-patched to all branches
> > > > and includes Change-1, Change-2, and a test case for them. FYI this
> > > > patch resolves the assertion failure reported in this thread as well
> > > > as one reported in another thread[2]. So I borrowed some of the
> > > > changes from the patch[2] Osumi-san recently proposed.
> > > >
> > >
> > > Amit pointed out offlist that the changes in reorderbuffer.c is not
> > > pgindent'ed. I've run pgindent and attached updated patches.
> > >
> >
> > Thanks, I have tested these across all branches till v10 and it works
> > as expected. I am planning to push this tomorrow unless I see any
> > further comments.
> >
>
> Pushed.

Thank you!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



On Thu, Oct 20, 2022 at 6:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Oct 19, 2022 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached patches for Change-3 with a test case. Please review them as well.
> >
>
> The patch looks mostly good to me apart from few minor comments which
> are as follows:
> 1.
> +# The last decoding restarts from the first checkpoint, and add
> invalidation messages
> +# generated by "s0_truncate" to the subtransaction. When decoding the
> commit record of
> +# the top-level transaction, we mark both top-level transaction and
> its subtransactions
> +# as containing catalog changes. However, we check if we don't create
> the association
> +# between top-level and subtransactions at this time. Otherwise, we
> miss executing
> +# invalidation messages when forgetting the transaction.
> +permutation "s0_init" "s0_begin" "s0_savepoint" "s0_insert"
> "s1_checkpoint" "s1_get_changes" "s0_truncate" "s0_commit" "s0_begin"
> "s0_insert" "s1_checkpoint" "s1_get_changes" "s0_commit"
> "s1_get_changes"
>
> The second part of this comment seems to say things more than required
> which makes it less clear. How about something like: "The last
> decoding restarts from the first checkpoint and adds invalidation
> messages generated by "s0_truncate" to the subtransaction. While
> processing the commit record for the top-level transaction, we decide
> to skip this xact but ensure that corresponding invalidation messages
> get processed."?
>
> 2.
> + /*
> + * We will assign subtransactions to the top transaction before
> + * replaying the contents of the transaction.
> + */
>
> I don't think we need this comment.
>

Thank you for the comment! I agreed with all comments and I've updated
patches accordingly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment
On Fri, Oct 21, 2022 at 8:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thank you for the comment! I agreed with all comments and I've updated
> patches accordingly.
>

Pushed after removing the test case from v11-13 branches as it is not
relevant to those branches and the test-1 in
catalog_change_snapshot.spec already tests the same case for those
branches.

-- 
With Regards,
Amit Kapila.



Hello,

21.10.2022 08:49, Amit Kapila wrote:
> On Fri, Oct 21, 2022 at 8:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for the comment! I agreed with all comments and I've updated
>> patches accordingly.
>>
> Pushed after removing the test case from v11-13 branches as it is not
> relevant to those branches and the test-1 in
> catalog_change_snapshot.spec already tests the same case for those
> branches.

I've managed to get that assertion failure again (on master) while playing
with the concurrent installcheck. This can be easily reproduced with the
following script:
numclients=5
for ((c=1;c<=numclients;c++)); do
   cp -r contrib/test_decoding contrib/test_decoding_$c
   sed "s/isolation_slot/isolation_slot_$c/" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Use 
independent slots
   sed "$(printf '$p; %.0s' `seq 50`)" -i contrib/test_decoding_$c/specs/catalog_change_snapshot.spec # Repeat the last

permutation 50 times
done
for ((c=1;c<=numclients;c++)); do
   EXTRA_REGRESS_OPTS="--dbname=regress_$c" make -s installcheck-force -C contrib/test_decoding_$c USE_MODULE_DB=1 
 >"installcheck-$c.log" 2>&1 &
done
wait
grep 'TRAP:' server.log

Produces for me:
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794105
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794104
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794099
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794105
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794104
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn"), File: "reorderbuffer.c", Line: 942, PID: 3794099

With the debug logging added inside AssertTXNLsnOrder() I see:
ctx->snapshot_builder->start_decoding_at: 209807224, ctx->reader->EndRecPtr: 210043072,
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
and inside the loop:
cur_txn->first_lsn: 209792872
cur_txn->first_lsn: 209975744
cur_txn->first_lsn: 210043008
cur_txn->first_lsn: 210043008
and it triggers the Assert.

Best regards,
Alexander



On 6/6/23 11:00, Alexander Lakhin wrote:
> Hello,
> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
> ctx->snapshot_builder->start_decoding_at: 209807224,
> ctx->reader->EndRecPtr: 210043072,
> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
> and inside the loop:
> cur_txn->first_lsn: 209792872
> cur_txn->first_lsn: 209975744
> cur_txn->first_lsn: 210043008
> cur_txn->first_lsn: 210043008
> and it triggers the Assert.
> 

So what's the prev_first_lsn value for these first_lsn values? How does
it change over time? Did you try looking at the pg_waldump for these
positions?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Hello Tomas,

06.06.2023 12:56, Tomas Vondra wrote:
> On 6/6/23 11:00, Alexander Lakhin wrote:
>> Hello,
>> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
>> ctx->snapshot_builder->start_decoding_at: 209807224,
>> ctx->reader->EndRecPtr: 210043072,
>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>> and inside the loop:
>> cur_txn->first_lsn: 209792872
>> cur_txn->first_lsn: 209975744
>> cur_txn->first_lsn: 210043008
>> cur_txn->first_lsn: 210043008
>> and it triggers the Assert.
>>
> So what's the prev_first_lsn value for these first_lsn values? How does
> it change over time? Did you try looking at the pg_waldump for these
> positions?

With more logging I've got (for another run):
ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80

ctx->snapshot_builder->start_decoding_at: c1f2cc0, ctx->reader->EndRecPtr: c1fbcc0, 
SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...

waldump for 00000001000000000000000C shows:
grep c1fbc80:
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn: 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 
1663/18763/19987, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT
off:
 
1, flags: 0x08, blkref #0: rel 1663/18763/19987 blk 0

grep '( 3397| 3398)'
rmgr: Transaction len (rec/tot):     43/    43, tx:       3398, lsn: 0/0C1F2B20, prev 0/0C1F2688, desc: ASSIGNMENT xtop

3397: subxacts: 3398
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1F2B50, prev 0/0C1F2B20, desc: INSERT+INIT
off:
 
1, flags: 0x08, blkref #0: rel 1663/18763/19981 blk 0
rmgr: Standby     len (rec/tot):     62/    62, tx:          0, lsn: 0/0C1F2BD0, prev 0/0C1F2B90, desc: RUNNING_XACTS 
nextXid 3400 latestCompletedXid 3396 oldestRunningXid 3397; 2 xacts: 3399 3397; 1 subxacts: 3398
rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn: 0/0C1F2C80, prev 0/0C1F2C50, desc: RUNNING_XACTS 
nextXid 3400 latestCompletedXid 3399 oldestRunningXid 3397; 1 xacts: 3397; 1 subxacts: 3398
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn: 0/0C1F2CC0, prev 0/0C1F2C80, desc: 
CHECKPOINT_ONLINE redo 0/C1F2C10; tli 1; prev tli 1; fpw true; xid 0:3400; oid 24576; multi 13; offset 29; oldest xid 
722 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 3397; online
rmgr: Standby     len (rec/tot):     62/    62, tx:          0, lsn: 0/0C1FBAD0, prev 0/0C1FBAA0, desc: RUNNING_XACTS 
nextXid 3401 latestCompletedXid 3399 oldestRunningXid 3397; 2 xacts: 3400 3397; 1 subxacts: 3398
rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn: 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 
1663/18763/19987, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn: 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT
off:
 
1, flags: 0x08, blkref #0: rel 1663/18763/19987 blk 0
rmgr: Transaction len (rec/tot):     54/    54, tx:       3397, lsn: 0/0C1FBD00, prev 0/0C1FBCC0, desc: COMMIT 
2023-06-06 13:55:26.955268 MSK; subxacts: 3398

Best regards,
Alexander




On 6/6/23 14:00, Alexander Lakhin wrote:
> Hello Tomas,
> 
> 06.06.2023 12:56, Tomas Vondra wrote:
>> On 6/6/23 11:00, Alexander Lakhin wrote:
>>> Hello,
>>> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
>>> ctx->snapshot_builder->start_decoding_at: 209807224,
>>> ctx->reader->EndRecPtr: 210043072,
>>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>>> and inside the loop:
>>> cur_txn->first_lsn: 209792872
>>> cur_txn->first_lsn: 209975744
>>> cur_txn->first_lsn: 210043008
>>> cur_txn->first_lsn: 210043008
>>> and it triggers the Assert.
>>>
>> So what's the prev_first_lsn value for these first_lsn values? How does
>> it change over time? Did you try looking at the pg_waldump for these
>> positions?
> 
> With more logging I've got (for another run):
> ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80
> 
> ctx->snapshot_builder->start_decoding_at: c1f2cc0,
> ctx->reader->EndRecPtr: c1fbcc0,
> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
> prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
> prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
> TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...
> 
> waldump for 00000001000000000000000C shows:
> grep c1fbc80:
> rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn:
> 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
> 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
> rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn:
> 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
> blkref #0: rel 1663/18763/19987 blk 0
> 
> grep '( 3397| 3398)'

I've been able to reproduce this, after messing with the script a little
bit (I had to skip the test_decoding regression tests, because that was
complaining about slots already existing etc).

Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):

  26662 0/6462E6F0 (first 0/0)
  26661 0/6462E6F0 (first 0/6462E6F0)


where 26661 is the top xact, 26662 is a subxact of 26661. This is
clearly a problem, because we really should not have subxact in this
list once the assignment gets applied.

And the relevant WAL looks like this:

---------------------------------------------------------------------
26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
subxacts: 26662
26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
...
    0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
26661 26664; 3 subxacts: 26668 26662 26665
...
26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
16:41:24.442870 CEST; subxacts: 26662
---------------------------------------------------------------------

so the assignment is the *first* thing that happens for these xacts.

However, we skip the assignment, because the log for this call of
get_changes says this:

   LOG:  logical decoding found consistent point at 0/6462E5D8

so we fail to realize the 26662 is a subxact.

Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
does this:

  ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);

  ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
                 xlrec->target_locator, xlrec->target_tid,
                 xlrec->cmin, xlrec->cmax,
                 xlrec->combocid);

and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
record. But ReorderBufferXidSetCatalogChanges() already did the same
thing for the subxact 26662, as it has no idea it's a subxact (due to
the skipped assignment).

I haven't figured out what exactly is happening / what it should be
doing instead. But it seems wrong to skip the assignment - I wonder if
SnapBuildProcessRunningXacts might be doing that too eagerly.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




On 6/6/23 17:42, Tomas Vondra wrote:
> 
> 
> On 6/6/23 14:00, Alexander Lakhin wrote:
>> Hello Tomas,
>>
>> 06.06.2023 12:56, Tomas Vondra wrote:
>>> On 6/6/23 11:00, Alexander Lakhin wrote:
>>>> Hello,
>>>> ...> With the debug logging added inside AssertTXNLsnOrder() I see:
>>>> ctx->snapshot_builder->start_decoding_at: 209807224,
>>>> ctx->reader->EndRecPtr: 210043072,
>>>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>>>> and inside the loop:
>>>> cur_txn->first_lsn: 209792872
>>>> cur_txn->first_lsn: 209975744
>>>> cur_txn->first_lsn: 210043008
>>>> cur_txn->first_lsn: 210043008
>>>> and it triggers the Assert.
>>>>
>>> So what's the prev_first_lsn value for these first_lsn values? How does
>>> it change over time? Did you try looking at the pg_waldump for these
>>> positions?
>>
>> With more logging I've got (for another run):
>> ReorderBufferTXNByXid| xid: 3397, lsn: c1fbc80
>>
>> ctx->snapshot_builder->start_decoding_at: c1f2cc0,
>> ctx->reader->EndRecPtr: c1fbcc0,
>> SnapBuildXactNeedsSkip(ctx->snapshot_builder, ctx->reader->EndRecPtr): 0
>> prev_first_lsn: 0, cur_txn->first_lsn: c1fbc80
>> prev_first_lsn: c1fbc80, cur_txn->first_lsn: c1fbc80
>> TRAP: failed Assert("prev_first_lsn < cur_txn->first_lsn") ...
>>
>> waldump for 00000001000000000000000C shows:
>> grep c1fbc80:
>> rmgr: Heap2       len (rec/tot):     60/    60, tx:       3398, lsn:
>> 0/0C1FBC80, prev 0/0C1FBC50, desc: NEW_CID rel: 1663/18763/19987, tid:
>> 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
>> rmgr: Heap        len (rec/tot):     59/    59, tx:       3398, lsn:
>> 0/0C1FBCC0, prev 0/0C1FBC80, desc: INSERT+INIT off: 1, flags: 0x08,
>> blkref #0: rel 1663/18763/19987 blk 0
>>
>> grep '( 3397| 3398)'
> 
> I've been able to reproduce this, after messing with the script a little
> bit (I had to skip the test_decoding regression tests, because that was
> complaining about slots already existing etc).
> 
> Anyway, AssertTXNLsnOrder sees these two transactions (before aborting):
> 
>   26662 0/6462E6F0 (first 0/0)
>   26661 0/6462E6F0 (first 0/6462E6F0)
> 
> 
> where 26661 is the top xact, 26662 is a subxact of 26661. This is
> clearly a problem, because we really should not have subxact in this
> list once the assignment gets applied.
> 
> And the relevant WAL looks like this:
> 
> ---------------------------------------------------------------------
> 26662, lsn: 0/645EDAA0, prev 0/645EDA60, desc: ASSIGNMENT xtop 26661:
> subxacts: 26662
> 26662, lsn: 0/645EDAD0, prev 0/645EDAA0, desc: INSERT+INIT off: 1,
> flags: 0x08, blkref #0: rel 1663/125447/126835 blk 0
> ...
>     0, lsn: 0/6462E5D8, prev 0/6462E2A0, desc: RUNNING_XACTS nextXid
> 26673 latestCompletedXid 26672 oldestRunningXid 26661; 3 xacts: 26667
> 26661 26664; 3 subxacts: 26668 26662 26665
> ...
> 26662, lsn: 0/6462E6F0, prev 0/6462E678, desc: NEW_CID rel:
> 1663/125447/126841, tid: 0/1, cmin: 1, cmax: 4294967295, combo: 4294967295
> 26662, lsn: 0/6462E730, prev 0/6462E6F0, desc: INSERT+INIT off: 1,
> flags: 0x08, blkref #0: rel 1663/125447/126841 blk 0
> 26661, lsn: 0/6462E770, prev 0/6462E730, desc: COMMIT 2023-06-06
> 16:41:24.442870 CEST; subxacts: 26662
> ---------------------------------------------------------------------
> 
> so the assignment is the *first* thing that happens for these xacts.
> 
> However, we skip the assignment, because the log for this call of
> get_changes says this:
> 
>    LOG:  logical decoding found consistent point at 0/6462E5D8
> 
> so we fail to realize the 26662 is a subxact.
> 
> Then when processing the NEW_CID, SnapBuildProcessNewCid chimes in and
> does this:
> 
>   ReorderBufferXidSetCatalogChanges(builder->reorder, xid, lsn);
> 
>   ReorderBufferAddNewTupleCids(builder->reorder, xlrec->top_xid, lsn,
>                  xlrec->target_locator, xlrec->target_tid,
>                  xlrec->cmin, xlrec->cmax,
>                  xlrec->combocid);
> 
> and ReorderBufferAddNewTupleCids() proceeds to enter an entry for the
> passed XID (which is xlrec->top_xid, 26661), but with LSN of the WAL
> record. But ReorderBufferXidSetCatalogChanges() already did the same
> thing for the subxact 26662, as it has no idea it's a subxact (due to
> the skipped assignment).
> 
> I haven't figured out what exactly is happening / what it should be
> doing instead. But it seems wrong to skip the assignment - I wonder if
> SnapBuildProcessRunningXacts might be doing that too eagerly.
> 

In investigated this a bit more, and the problem actually seems to be
more like this:

1) we create a new logical replication slot

2) while building the initial snapshot, we start with current insert
location, and then process records

3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
SnapBuildFindSnapshot

4) SnapBuildFindSnapshot does this:

    else if (!builder->building_full_snapshot &&
             SnapBuildRestore(builder, lsn))
    {
        /* there won't be any state to cleanup */
        return false;
    }

5) because create_logical_replication_slot and get_changes both call
CreateInitDecodingContext with needs_full_snapshot=false, we end up
calling SnapBuildRestore()

6) once in a while this likely hits a snapshot created by a concurrent
session (for another logical slot) with SNAPBUILD_CONSISTENT state


I don't know what's the correct fix for this. Maybe we should set
needs_full_snapshot=true in create_logical_replication_slot when
creating the initial snapshot? Maybe we should use true even in
pg_logical_slot_get_changes_guts? This seems to fix the crashes ...

That'll prevent reading the serialized snapshots like this, but how
could that ever work? It seems pretty much guaranteed to ignore any
assignments that happened right before the snapshot?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 6/6/23 17:42, Tomas Vondra wrote:
> >
>
> In investigated this a bit more, and the problem actually seems to be
> more like this:
>
> 1) we create a new logical replication slot
>
> 2) while building the initial snapshot, we start with current insert
> location, and then process records
>
> 3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
> SnapBuildFindSnapshot
>
> 4) SnapBuildFindSnapshot does this:
>
>     else if (!builder->building_full_snapshot &&
>              SnapBuildRestore(builder, lsn))
>     {
>         /* there won't be any state to cleanup */
>         return false;
>     }
>
> 5) because create_logical_replication_slot and get_changes both call
> CreateInitDecodingContext with needs_full_snapshot=false, we end up
> calling SnapBuildRestore()
>
> 6) once in a while this likely hits a snapshot created by a concurrent
> session (for another logical slot) with SNAPBUILD_CONSISTENT state
>

I think this analysis doesn't seem to match what you mentioned in the
previous email which is as follows:
> > However, we skip the assignment, because the log for this call of
> > get_changes says this:
> >
> >    LOG:  logical decoding found consistent point at 0/6462E5D8
> >
> > so we fail to realize the 26662 is a subxact.

This is because the above LOG is printed when
"running->oldestRunningXid == running->nextXid" not when we restore
the snapshot. Am, I missing something?

>
> I don't know what's the correct fix for this. Maybe we should set
> needs_full_snapshot=true in create_logical_replication_slot when
> creating the initial snapshot? Maybe we should use true even in
> pg_logical_slot_get_changes_guts? This seems to fix the crashes ...
>

I don't think that is advisable because setting "needs_full_snapshot"
to true for decoding means the snapshot will start tracking
non-catalog committed xacts as well which is costly and is not
required for this case.

> That'll prevent reading the serialized snapshots like this, but how
> could that ever work? It seems pretty much guaranteed to ignore any
> assignments that happened right before the snapshot?
>

This part needs some analysis/thoughts. BTW, do you mean that it skips
the assignment (a) because the assignment record is before we reach a
consistent point, or (b) because we start reading WAL after the
assignment, or (c) something else? If you intend to say (a) then can
you please point me to the code you are referring to for the same?

--
With Regards,
Amit Kapila.




On 6/7/23 07:18, Amit Kapila wrote:
> On Wed, Jun 7, 2023 at 6:18 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 6/6/23 17:42, Tomas Vondra wrote:
>>>
>>
>> In investigated this a bit more, and the problem actually seems to be
>> more like this:
>>
>> 1) we create a new logical replication slot
>>
>> 2) while building the initial snapshot, we start with current insert
>> location, and then process records
>>
>> 3) for RUNNING_XACTS we call SnapBuildProcessRunningXacts, which calls
>> SnapBuildFindSnapshot
>>
>> 4) SnapBuildFindSnapshot does this:
>>
>>     else if (!builder->building_full_snapshot &&
>>              SnapBuildRestore(builder, lsn))
>>     {
>>         /* there won't be any state to cleanup */
>>         return false;
>>     }
>>
>> 5) because create_logical_replication_slot and get_changes both call
>> CreateInitDecodingContext with needs_full_snapshot=false, we end up
>> calling SnapBuildRestore()
>>
>> 6) once in a while this likely hits a snapshot created by a concurrent
>> session (for another logical slot) with SNAPBUILD_CONSISTENT state
>>
> 
> I think this analysis doesn't seem to match what you mentioned in the
> previous email which is as follows:
>>> However, we skip the assignment, because the log for this call of
>>> get_changes says this:
>>>
>>>    LOG:  logical decoding found consistent point at 0/6462E5D8
>>>
>>> so we fail to realize the 26662 is a subxact.
> 
> This is because the above LOG is printed when
> "running->oldestRunningXid == running->nextXid" not when we restore
> the snapshot. Am, I missing something?
> 

There are multiple places in snapbuild.c with the same message. Two in
SnapBuildFindSnapshot (one of them being the one you mentioned) and one
in SnapBuildRestore (which is the one actually triggered).

>>
>> I don't know what's the correct fix for this. Maybe we should set
>> needs_full_snapshot=true in create_logical_replication_slot when
>> creating the initial snapshot? Maybe we should use true even in
>> pg_logical_slot_get_changes_guts? This seems to fix the crashes ...
>>
> 
> I don't think that is advisable because setting "needs_full_snapshot"
> to true for decoding means the snapshot will start tracking
> non-catalog committed xacts as well which is costly and is not
> required for this case.
> 

True. TBH I managed to forget most of these details, so I meant it more
like a data point that it seems to fix the issue for me.

>> That'll prevent reading the serialized snapshots like this, but how
>> could that ever work? It seems pretty much guaranteed to ignore any
>> assignments that happened right before the snapshot?
>>
> 
> This part needs some analysis/thoughts. BTW, do you mean that it skips
> the assignment (a) because the assignment record is before we reach a
> consistent point, or (b) because we start reading WAL after the
> assignment, or (c) something else? If you intend to say (a) then can
> you please point me to the code you are referring to for the same?
> 

Well, I think the issue is pretty clear - we end up with an initial
snapshot that's in between the ASSIGNMENT and NEW_CID, and because
NEW_CID has both xact and subxact XID it fails because we add two TXNs
with the same LSN, not realizing one of them is subxact.

That's obviously wrong, although somewhat benign in production because
it only fails because of hitting an assert. Regular builds are likely to
just ignore it, although I haven't checked if the COMMIT cleanup (I
wonder if we remove the subxact from the toplevel list on commit).

I think the problem is we just grab an existing snapshot, before all
running xacts complete. Maybe we should fix that, and leave the
needs_full_snapshot alone. Haven't tried that, though.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
> Well, I think the issue is pretty clear - we end up with an initial
> snapshot that's in between the ASSIGNMENT and NEW_CID, and because
> NEW_CID has both xact and subxact XID it fails because we add two TXNs
> with the same LSN, not realizing one of them is subxact.
>
> That's obviously wrong, although somewhat benign in production because
> it only fails because of hitting an assert.
>

Doesn't this indicate that we can end up decoding a partial
transaction when we restore a snapshot? Won't that be a problem even
for production?

> Regular builds are likely to
> just ignore it, although I haven't checked if the COMMIT cleanup (I
> wonder if we remove the subxact from the toplevel list on commit).
>
> I think the problem is we just grab an existing snapshot, before all
> running xacts complete. Maybe we should fix that, and leave the
> needs_full_snapshot alone.
>

It is not clear what exactly you have in mind to fix this because if
there is no running xact, we don't even need to restore the snapshot
because of a prior check "if (running->oldestRunningXid ==
running->nextXid)". I think the main problem is that we started
decoding immediately from the point where we restored a snapshot as at
that point we could have some partial running xacts.

--
With Regards,
Amit Kapila.



On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> >
> > Well, I think the issue is pretty clear - we end up with an initial
> > snapshot that's in between the ASSIGNMENT and NEW_CID, and because
> > NEW_CID has both xact and subxact XID it fails because we add two TXNs
> > with the same LSN, not realizing one of them is subxact.
> >
> > That's obviously wrong, although somewhat benign in production because
> > it only fails because of hitting an assert.
> >
> 
> Doesn't this indicate that we can end up decoding a partial transaction when
> we restore a snapshot? Won't that be a problem even for production?

Yes, I think it can cause the problem that only partial changes of a transaction are streamed.
I tried to reproduce this and here are the steps. Note, to make sure the test
won't be affected by other running_xact WALs, I changed LOG_SNAPSHOT_INTERVAL_MS
to a bigger number.

session 1:
-----
create table test(a int);
SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1', 'test_decoding');
-----

session 2:
-----
- Start a transaction
BEGIN;
INSERT INTO test VALUES(1);
-----

session 3:
-----
- Create another slot isolation_slot_2, it should choose a restart_point which is
- after the changes that happened in session 2. Note, to let the current slot
- restore another snapshot, we need to use gdb to block the current backend at
- SnapBuildFindSnapshot(), the backend should have logged the running_xacts WAL
- before reaching SnapBuildFindSnapshot.

SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2', 'test_decoding');
-----

session 1:
-----
- Since there is a running_xacts which session 3 logged, the current backend will
- serialize the snapshot when decoding the running_xacts WAL, and the snapshot
- can be used by other slots (e.g. isolation_slot_2)

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
-----

session 2:
-----
- Insert some different data and commit the transaction.

INSERT INTO test VALUES(2);
INSERT INTO test VALUES(3);
INSERT INTO test VALUES(4);
COMMIT
-----

session 3:
-----
- Release the process and try to stream the changes, since the restart point is
- at the middle of the transaction, it will stream partial changes of the
- transaction which was committed in session 2:

SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids',
'0');
-----

Results (partial streamed changes):
postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL, 'skip-empty-xacts', '1',
'include-xids','0');
 
                  data
-----------------------------------------
 BEGIN
 table public.test: INSERT: a[integer]:2
 table public.test: INSERT: a[integer]:3
 table public.test: INSERT: a[integer]:4
 COMMIT
(5 rows)

> 
> > Regular builds are likely to
> > just ignore it, although I haven't checked if the COMMIT cleanup (I
> > wonder if we remove the subxact from the toplevel list on commit).
> >
> > I think the problem is we just grab an existing snapshot, before all
> > running xacts complete. Maybe we should fix that, and leave the
> > needs_full_snapshot alone.
> >
> 
> It is not clear what exactly you have in mind to fix this because if there is no
> running xact, we don't even need to restore the snapshot because of a prior
> check "if (running->oldestRunningXid ==
> running->nextXid)". I think the main problem is that we started
> decoding immediately from the point where we restored a snapshot as at that
> point we could have some partial running xacts.


Best Regards,
Hou zj

On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> 
> On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > >
> > > Well, I think the issue is pretty clear - we end up with an initial
> > > snapshot that's in between the ASSIGNMENT and NEW_CID, and because
> > > NEW_CID has both xact and subxact XID it fails because we add two
> > > TXNs with the same LSN, not realizing one of them is subxact.
> > >
> > > That's obviously wrong, although somewhat benign in production
> > > because it only fails because of hitting an assert.
> > >
> >
> > Doesn't this indicate that we can end up decoding a partial
> > transaction when we restore a snapshot? Won't that be a problem even for
> production?
> 
> Yes, I think it can cause the problem that only partial changes of a transaction
> are streamed.
> I tried to reproduce this and here are the steps. Note, to make sure the test
> won't be affected by other running_xact WALs, I changed
> LOG_SNAPSHOT_INTERVAL_MS to a bigger number.
> 
> session 1:
> -----
> create table test(a int);
> SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1',
> 'test_decoding');
> -----
> 
> session 2:
> -----
> - Start a transaction
> BEGIN;
> INSERT INTO test VALUES(1);
> -----
> 
> session 3:
> -----
> - Create another slot isolation_slot_2, it should choose a restart_point which is
> - after the changes that happened in session 2. Note, to let the current slot
> - restore another snapshot, we need to use gdb to block the current backend
> at
> - SnapBuildFindSnapshot(), the backend should have logged the running_xacts
> WAL
> - before reaching SnapBuildFindSnapshot.
> 
> SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2',
> 'test_decoding');
> -----
> 
> session 1:
> -----
> - Since there is a running_xacts which session 3 logged, the current backend
> will
> - serialize the snapshot when decoding the running_xacts WAL, and the
> snapshot
> - can be used by other slots (e.g. isolation_slot_2)
> 
> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
> 'skip-empty-xacts', '1', 'include-xids', '0');
> -----
> 
> session 2:
> -----
> - Insert some different data and commit the transaction.
> 
> INSERT INTO test VALUES(2);
> INSERT INTO test VALUES(3);
> INSERT INTO test VALUES(4);
> COMMIT
> -----
> 
> session 3:
> -----
> - Release the process and try to stream the changes, since the restart point is
> - at the middle of the transaction, it will stream partial changes of the
> - transaction which was committed in session 2:
> 
> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL,
> 'skip-empty-xacts', '1', 'include-xids', '0');
> -----
> 
> Results (partial streamed changes):
> postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
> NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
>                   data
> -----------------------------------------
>  BEGIN
>  table public.test: INSERT: a[integer]:2  table public.test: INSERT: a[integer]:3
> table public.test: INSERT: a[integer]:4  COMMIT
> (5 rows)
> 

One idea to fix the partial change stream problem would be that we record all
the running transaction's xid when restoring the snapshot in
SnapBuildFindSnapshot(), and in the following decoding, we skip decoding
changes for the recorded transaction. Or we can do similar to 7f13ac8(serialize the
information of running xacts if any)

But one point I am not very sure is that we might retore snapshot in
SnapBuildSerializationPoint() as well where we don't have running transactions
information. Although SnapBuildSerializationPoint() is invoked for
XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN records which seems no
running transaction will be there when logging. But I am not 100% sure if the
problem can happen in this case as well.

Thoughts ?

Best Regards,
Hou zj


On 6/14/23 05:15, Zhijie Hou (Fujitsu) wrote:
> On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
>>
>> On Tuesday, June 13, 2023 12:04 PM Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>>
>>> On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
>>> <tomas.vondra@enterprisedb.com> wrote:
>>>>
>>>>
>>>> Well, I think the issue is pretty clear - we end up with an initial
>>>> snapshot that's in between the ASSIGNMENT and NEW_CID, and because
>>>> NEW_CID has both xact and subxact XID it fails because we add two
>>>> TXNs with the same LSN, not realizing one of them is subxact.
>>>>
>>>> That's obviously wrong, although somewhat benign in production
>>>> because it only fails because of hitting an assert.
>>>>
>>>
>>> Doesn't this indicate that we can end up decoding a partial
>>> transaction when we restore a snapshot? Won't that be a problem even for
>> production?
>>
>> Yes, I think it can cause the problem that only partial changes of a transaction
>> are streamed.
>> I tried to reproduce this and here are the steps. Note, to make sure the test
>> won't be affected by other running_xact WALs, I changed
>> LOG_SNAPSHOT_INTERVAL_MS to a bigger number.
>>
>> session 1:
>> -----
>> create table test(a int);
>> SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_1',
>> 'test_decoding');
>> -----
>>
>> session 2:
>> -----
>> - Start a transaction
>> BEGIN;
>> INSERT INTO test VALUES(1);
>> -----
>>
>> session 3:
>> -----
>> - Create another slot isolation_slot_2, it should choose a restart_point which is
>> - after the changes that happened in session 2. Note, to let the current slot
>> - restore another snapshot, we need to use gdb to block the current backend
>> at
>> - SnapBuildFindSnapshot(), the backend should have logged the running_xacts
>> WAL
>> - before reaching SnapBuildFindSnapshot.
>>
>> SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot_2',
>> 'test_decoding');
>> -----
>>
>> session 1:
>> -----
>> - Since there is a running_xacts which session 3 logged, the current backend
>> will
>> - serialize the snapshot when decoding the running_xacts WAL, and the
>> snapshot
>> - can be used by other slots (e.g. isolation_slot_2)
>>
>> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1', NULL, NULL,
>> 'skip-empty-xacts', '1', 'include-xids', '0');
>> -----
>>
>> session 2:
>> -----
>> - Insert some different data and commit the transaction.
>>
>> INSERT INTO test VALUES(2);
>> INSERT INTO test VALUES(3);
>> INSERT INTO test VALUES(4);
>> COMMIT
>> -----
>>
>> session 3:
>> -----
>> - Release the process and try to stream the changes, since the restart point is
>> - at the middle of the transaction, it will stream partial changes of the
>> - transaction which was committed in session 2:
>>
>> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2', NULL, NULL,
>> 'skip-empty-xacts', '1', 'include-xids', '0');
>> -----
>>
>> Results (partial streamed changes):
>> postgres=# SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
>> NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
>>                   data
>> -----------------------------------------
>>  BEGIN
>>  table public.test: INSERT: a[integer]:2  table public.test: INSERT: a[integer]:3
>> table public.test: INSERT: a[integer]:4  COMMIT
>> (5 rows)
>>
> 
> One idea to fix the partial change stream problem would be that we record all
> the running transaction's xid when restoring the snapshot in
> SnapBuildFindSnapshot(), and in the following decoding, we skip decoding
> changes for the recorded transaction. Or we can do similar to 7f13ac8(serialize the
> information of running xacts if any)
> 

We need to think about how to fix this in backbranches, and the idea
with serializing running transactions seems rather unbackpatchable (as
it changes on-disk state).

> But one point I am not very sure is that we might retore snapshot in
> SnapBuildSerializationPoint() as well where we don't have running transactions
> information. Although SnapBuildSerializationPoint() is invoked for
> XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN records which seems no
> running transaction will be there when logging. But I am not 100% sure if the
> problem can happen in this case as well.
> 

So, is the problem that we grab an existing snapshot in SnapBuildRestore
when called from SnapBuildFindSnapshot? If so, would it be possible to
just skip this while building the initial snapshot?

I tried that (by commenting out the block in SnapBuildFindSnapshot), but
it causes some output changes in test_decoding regression tests. I
haven't investigated why exactly.

Also, can you try if we still stream the partial transaction with
create_logical_replication_slot building a full snapshot?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wednesday, June 14, 2023 5:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> On 6/14/23 05:15, Zhijie Hou (Fujitsu) wrote:
> > On Tuesday, June 13, 2023 12:19 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >>
> >> On Tuesday, June 13, 2023 12:04 PM Amit Kapila
> >> <amit.kapila16@gmail.com>
> >> wrote:
> >>>
> >>> On Wed, Jun 7, 2023 at 6:02 PM Tomas Vondra
> >>> <tomas.vondra@enterprisedb.com> wrote:
> >>>>
> >>>>
> >>>> Well, I think the issue is pretty clear - we end up with an initial
> >>>> snapshot that's in between the ASSIGNMENT and NEW_CID, and because
> >>>> NEW_CID has both xact and subxact XID it fails because we add two
> >>>> TXNs with the same LSN, not realizing one of them is subxact.
> >>>>
> >>>> That's obviously wrong, although somewhat benign in production
> >>>> because it only fails because of hitting an assert.
> >>>>
> >>>
> >>> Doesn't this indicate that we can end up decoding a partial
> >>> transaction when we restore a snapshot? Won't that be a problem even
> >>> for
> >> production?
> >>
> >> Yes, I think it can cause the problem that only partial changes of a
> >> transaction are streamed.
> >> I tried to reproduce this and here are the steps. Note, to make sure
> >> the test won't be affected by other running_xact WALs, I changed
> >> LOG_SNAPSHOT_INTERVAL_MS to a bigger number.
> >>
> >> session 1:
> >> -----
> >> create table test(a int);
> >> SELECT 'init' FROM
> >> pg_create_logical_replication_slot('isolation_slot_1',
> >> 'test_decoding');
> >> -----
> >>
> >> session 2:
> >> -----
> >> - Start a transaction
> >> BEGIN;
> >> INSERT INTO test VALUES(1);
> >> -----
> >>
> >> session 3:
> >> -----
> >> - Create another slot isolation_slot_2, it should choose a
> >> restart_point which is
> >> - after the changes that happened in session 2. Note, to let the
> >> current slot
> >> - restore another snapshot, we need to use gdb to block the current
> >> backend at
> >> - SnapBuildFindSnapshot(), the backend should have logged the
> >> running_xacts WAL
> >> - before reaching SnapBuildFindSnapshot.
> >>
> >> SELECT 'init' FROM
> >> pg_create_logical_replication_slot('isolation_slot_2',
> >> 'test_decoding');
> >> -----
> >>
> >> session 1:
> >> -----
> >> - Since there is a running_xacts which session 3 logged, the current
> >> backend will
> >> - serialize the snapshot when decoding the running_xacts WAL, and the
> >> snapshot
> >> - can be used by other slots (e.g. isolation_slot_2)
> >>
> >> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_1',
> >> NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
> >> -----
> >>
> >> session 2:
> >> -----
> >> - Insert some different data and commit the transaction.
> >>
> >> INSERT INTO test VALUES(2);
> >> INSERT INTO test VALUES(3);
> >> INSERT INTO test VALUES(4);
> >> COMMIT
> >> -----
> >>
> >> session 3:
> >> -----
> >> - Release the process and try to stream the changes, since the
> >> restart point is
> >> - at the middle of the transaction, it will stream partial changes of
> >> the
> >> - transaction which was committed in session 2:
> >>
> >> SELECT data FROM pg_logical_slot_get_changes('isolation_slot_2',
> >> NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
> >> -----
> >>
> >> Results (partial streamed changes):
> >> postgres=# SELECT data FROM
> >> pg_logical_slot_get_changes('isolation_slot_2',
> >> NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');
> >>                   data
> >> -----------------------------------------
> >>  BEGIN
> >>  table public.test: INSERT: a[integer]:2  table public.test: INSERT:
> >> a[integer]:3 table public.test: INSERT: a[integer]:4  COMMIT
> >> (5 rows)
> >>
> >
> > One idea to fix the partial change stream problem would be that we
> > record all the running transaction's xid when restoring the snapshot
> > in SnapBuildFindSnapshot(), and in the following decoding, we skip
> > decoding changes for the recorded transaction. Or we can do similar to
> > 7f13ac8(serialize the information of running xacts if any)
> >
> 
> We need to think about how to fix this in backbranches, and the idea with
> serializing running transactions seems rather unbackpatchable (as it changes
> on-disk state).
> 
> > But one point I am not very sure is that we might retore snapshot in
> > SnapBuildSerializationPoint() as well where we don't have running
> > transactions information. Although SnapBuildSerializationPoint() is
> > invoked for XLOG_END_OF_RECOVERY and XLOG_CHECKPOINT_SHUTDOWN
> records
> > which seems no running transaction will be there when logging. But I
> > am not 100% sure if the problem can happen in this case as well.
> >
...
> 
> Also, can you try if we still stream the partial transaction with
> create_logical_replication_slot building a full snapshot?

Yes, It can fix this problem because force create_logical_replication_slot
build a full snapshot can avoid restoring the snapshot. But I am not sure if
this is the best fix for this for the same reason(it's costly) mentioned by
Amit[1].

[1] https://www.postgresql.org/message-id/CAA4eK1Kro0rej%3DZXMhcdjs%2BaYsZvNywu3-cqdRUtyAp4zqpVWw%40mail.gmail.com

Best Regards,
Hou zj


On 6/14/23 15:39, Zhijie Hou (Fujitsu) wrote:
> On Wednesday, June 14, 2023 5:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>> ...
>>
>> Also, can you try if we still stream the partial transaction with
>> create_logical_replication_slot building a full snapshot?
> 
> Yes, It can fix this problem because force create_logical_replication_slot
> build a full snapshot can avoid restoring the snapshot. But I am not sure if
> this is the best fix for this for the same reason(it's costly) mentioned by
> Amit[1].
> 

Costly compared to the current behavior? Sure, but considering the
current behavior is incorrect/broken, that seems rather irrelevant.
Incorrect behavior can be infinitely faster.

I doubt it's significantly costlier than just setting the "full
snapshot" flag when building the initial snapshot - sure, it will take
more time than now, but that's kinda the whole point. It seems to me the
problem is exactly that it *doesn't* wait long enough.

I may be misunderstanding the solution you proposed, but this:

    One idea to fix the partial change stream problem would be that we
    record all the running transaction's xid when restoring the snapshot
    in SnapBuildFindSnapshot(), and in the following decoding, we skip
    decoding changes for the recorded transaction

sounds pretty close to what building a correct snapshot actually does.

But maybe I'm wrong - ultimately, the way to compare those approaches
seems to be to prototype this proposal, and do some tests.

There's also the question of back branches, and it seems way simpler to
just flip a flag and disable broken optimization than doing fairly
complex stuff to save it.

I'd also point out that (a) this only affects the initial snapshot, not
every time we start the decoding context, and (b) the slots created from
walsender already do that with (unless when copy_data=false).

So if needs_full_snapshot=true fixes the issue, I'd just do that as the
first step - in master and backpatches. And then if we want to salvage
the optimization, we can try fixing it in master.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company