Re: ERROR: subtransaction logged without previous top-level txn record - Mailing list pgsql-bugs

From Arseny Sher
Subject Re: ERROR: subtransaction logged without previous top-level txn record
Date
Msg-id 87eez1fh48.fsf@ars-thinkpad
Whole thread Raw
In response to Re: ERROR: subtransaction logged without previous top-level txnrecord  (Andres Freund <andres@anarazel.de>)
Responses Re: ERROR: subtransaction logged without previous top-level txn record
Re: ERROR: subtransaction logged without previous top-level txn record
List pgsql-bugs
Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2019-10-24 12:59:30 +0300, Arseny Sher wrote:
>> Our customer also encountered this issue and I've looked into it. The problem is
>> reproduced well enough using the instructions in the previous message.
>
> Is this with
>
> commit bac2fae05c7737530a6fe8276cd27d210d25c6ac
> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date:   2019-09-13 16:36:28 -0300
>
>     logical decoding: process ASSIGNMENT during snapshot build
>
>     Most WAL records are ignored in early SnapBuild snapshot build phases.
>     But it's critical to process some of them, so that later messages have
>     the correct transaction state after the snapshot is completely built; in
>     particular, XLOG_XACT_ASSIGNMENT messages are critical in order for
>     sub-transactions to be correctly assigned to their parent transactions,
>     or at least one assert misbehaves, as reported by Ildar Musin.
>
>     Diagnosed-by: Masahiko Sawada
>     Author: Masahiko Sawada
>     Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com
>
> applied?

Yeah, I tried fresh master. See below: skipped xl_xact_assignment is
beyond restart_lsn at all (and thus not read), so this doesn't matter.


>> The check leading to this ERROR is too strict, it forbids legit behaviours. Say
>> we have in WAL
>>
>> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ]
>>
>> - First xl_xact_assignment record is beyond reading, i.e. earlier
>>   restart_lsn, where ready snapshot will be taken from disk.
>> - After restart_lsn there is some change of a subxact.
>> - After that, there is second xl_xact_assignment (for another subxact)
>>   revealing relationship between top and first subxact, where this ERROR fires.
>>
>> Such transaction won't be streamed because we hadn't seen it in full. It must be
>> finished before streaming will start, i.e. before confirmed_flush_lsn.
>>
>> Of course, the easiest fix is to just throw the check out.
>
> I don't think that'd actually be a fix, and just hiding a bug.

I don't see a bug here. At least in reproduced scenario I see false
alert, as explained above: transaction with skipped xl_xact_assignment
won't be streamed as it finishes before confirmed_flush_lsn. And I am
pretty sure people encountered in the field the same issue.

In the end of my mail I have offered a way to relax this check instead
of removing it to avoid false triggers: serialize/deserialize a snapshot
only at xl_running_xacts to know nextXid, add function to snapbuilder to
check whether xact can be streamed or not by looking at its xid, etc,
somehow deal with existing serialized snaps which may be logged at
END_OF_RECOVERY without nextXid. I don't believe this check is worth
these complexities. If you think it does, I can do that though.

>
>> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the
>> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not
>> just complete snapshot (snapshot by itself in FULL state is just good as in
>> CONSISTENT), but also reorderbuffer filled with all currently running
>> xacts. This is painless for decoding sessions with existing slots because they
>> won't stream anything before confirmed_flush_lsn is reached anyway, at which
>> point all transactions which hadn't got into reorderbuffer would definitely
>> finish. However, new slots might be created too early, thus losing (not
>> decoding) parts of transactions committed after freshly created
>> confirmed_flush_lsn. This can happen under the following extremely unlucky
>> circumstances:
>>   - New slot creation reserves point in WAL since which it would read it
>>     (GetXLogInsertRecPtr);
>>   - It logs xl_running_xacts to start assembling a snapshot;
>>   - Running decoding session with another slot quickly reads this
>>     xl_running_xacts and serializes its snapshot;
>>   - New slot reads xl_running_xacts and picks this snapshot up, saying that it
>>     is ready to stream henceforth, though its reorderbuffer is empty.
>
> Yea, that's a problem :(
>
>
>> Exact reproducing steps:
>>
>> -- session 1
>> create table t (i int);
>> select pg_create_logical_replication_slot('slot_1', 'test_decoding');
>>
>> -- session 2
>> begin;
>> insert into t values (1);
>>
>> -- session 3, start slot creation
>> select pg_create_logical_replication_slot('slot_2', 'test_decoding');
>> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx);
>>
>> -- session 1
>> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no
>> -- need to sleep; our snap will be immediately serialized there
>> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>>
>> -- continue slot_2 creation
>>
>> -- session 2: insert some more and commit
>> insert into t values (1);
>> commit;
>>
>> -- now this would find second insert, but not the first one
>> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0');
>
> It'd be good to make this into an isolation test.

Yeah, but to get real chance of firing this requires kinda sleep/break
in the middle of pg_create_logical_replication_slot execution, so I have
no idea how to do that =(

>
>> What we can do here?
>
> I think the easiest fix might actually be to just ignore serialized
> snapshots when building the initial snapshot.

That's an option. However, that anyway requires the distinction between
new slot creation and streaming from existing slot at snapbuilder level,
which currently doesn't exist and which constitutes most part of my
first patch. If we add that, changing between using and not using
serialized snapshots in new slot creation is easy (my patch uses it), I
think this is not principal.


--
Arseny Sher



pgsql-bugs by date:

Previous
From: Yuri Astrakhan
Date:
Subject: Re: BUG #16076: JIT causes huge delays in a complex query. jit=offsolves it.
Next
From: PG Bug reporting form
Date:
Subject: BUG #16078: Problems starting and running the app stack builder