Re: ERROR: subtransaction logged without previous top-level txn record - Mailing list pgsql-bugs
From | Arseny Sher |
---|---|
Subject | Re: ERROR: subtransaction logged without previous top-level txn record |
Date | |
Msg-id | 87eez1fh48.fsf@ars-thinkpad Whole thread Raw |
In response to | Re: ERROR: subtransaction logged without previous top-level txnrecord (Andres Freund <andres@anarazel.de>) |
Responses |
Re: ERROR: subtransaction logged without previous top-level txn record
Re: ERROR: subtransaction logged without previous top-level txn record |
List | pgsql-bugs |
Andres Freund <andres@anarazel.de> writes: > Hi, > > On 2019-10-24 12:59:30 +0300, Arseny Sher wrote: >> Our customer also encountered this issue and I've looked into it. The problem is >> reproduced well enough using the instructions in the previous message. > > Is this with > > commit bac2fae05c7737530a6fe8276cd27d210d25c6ac > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: 2019-09-13 16:36:28 -0300 > > logical decoding: process ASSIGNMENT during snapshot build > > Most WAL records are ignored in early SnapBuild snapshot build phases. > But it's critical to process some of them, so that later messages have > the correct transaction state after the snapshot is completely built; in > particular, XLOG_XACT_ASSIGNMENT messages are critical in order for > sub-transactions to be correctly assigned to their parent transactions, > or at least one assert misbehaves, as reported by Ildar Musin. > > Diagnosed-by: Masahiko Sawada > Author: Masahiko Sawada > Discussion: https://postgr.es/m/CAONYFtOv+Er1p3WAuwUsy1zsCFrSYvpHLhapC_fMD-zNaRWxYg@mail.gmail.com > > applied? Yeah, I tried fresh master. See below: skipped xl_xact_assignment is beyond restart_lsn at all (and thus not read), so this doesn't matter. >> The check leading to this ERROR is too strict, it forbids legit behaviours. Say >> we have in WAL >> >> [ <xl_xact_assignment_1> <restart_lsn> <subxact_change> <xl_xact_assignment_1> <commit> confirmed_flush_lsn> ] >> >> - First xl_xact_assignment record is beyond reading, i.e. earlier >> restart_lsn, where ready snapshot will be taken from disk. >> - After restart_lsn there is some change of a subxact. >> - After that, there is second xl_xact_assignment (for another subxact) >> revealing relationship between top and first subxact, where this ERROR fires. >> >> Such transaction won't be streamed because we hadn't seen it in full. It must be >> finished before streaming will start, i.e. before confirmed_flush_lsn. >> >> Of course, the easiest fix is to just throw the check out. > > I don't think that'd actually be a fix, and just hiding a bug. I don't see a bug here. At least in reproduced scenario I see false alert, as explained above: transaction with skipped xl_xact_assignment won't be streamed as it finishes before confirmed_flush_lsn. And I am pretty sure people encountered in the field the same issue. In the end of my mail I have offered a way to relax this check instead of removing it to avoid false triggers: serialize/deserialize a snapshot only at xl_running_xacts to know nextXid, add function to snapbuilder to check whether xact can be streamed or not by looking at its xid, etc, somehow deal with existing serialized snaps which may be logged at END_OF_RECOVERY without nextXid. I don't believe this check is worth these complexities. If you think it does, I can do that though. > >> Snapbuilder enters into SNAPBUILD_CONSISTENT immediately after deserializing the >> snapshot. Generally this is incorrect because SNAPBUILD_CONSISTENT means not >> just complete snapshot (snapshot by itself in FULL state is just good as in >> CONSISTENT), but also reorderbuffer filled with all currently running >> xacts. This is painless for decoding sessions with existing slots because they >> won't stream anything before confirmed_flush_lsn is reached anyway, at which >> point all transactions which hadn't got into reorderbuffer would definitely >> finish. However, new slots might be created too early, thus losing (not >> decoding) parts of transactions committed after freshly created >> confirmed_flush_lsn. This can happen under the following extremely unlucky >> circumstances: >> - New slot creation reserves point in WAL since which it would read it >> (GetXLogInsertRecPtr); >> - It logs xl_running_xacts to start assembling a snapshot; >> - Running decoding session with another slot quickly reads this >> xl_running_xacts and serializes its snapshot; >> - New slot reads xl_running_xacts and picks this snapshot up, saying that it >> is ready to stream henceforth, though its reorderbuffer is empty. > > Yea, that's a problem :( > > >> Exact reproducing steps: >> >> -- session 1 >> create table t (i int); >> select pg_create_logical_replication_slot('slot_1', 'test_decoding'); >> >> -- session 2 >> begin; >> insert into t values (1); >> >> -- session 3, start slot creation >> select pg_create_logical_replication_slot('slot_2', 'test_decoding'); >> -- stop (with gdb or something) it at DecodingContextFindStartpoint(ctx); >> >> -- session 1 >> -- xl_running_xacts is dumped by ReplicationSlotReserveWal in previous command, no >> -- need to sleep; our snap will be immediately serialized there >> SELECT data FROM pg_logical_slot_get_changes('slot_1', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); >> >> -- continue slot_2 creation >> >> -- session 2: insert some more and commit >> insert into t values (1); >> commit; >> >> -- now this would find second insert, but not the first one >> SELECT data FROM pg_logical_slot_get_changes('slot_2', NULL, NULL, 'include-xids', '1', 'skip-empty-xacts', '0'); > > It'd be good to make this into an isolation test. Yeah, but to get real chance of firing this requires kinda sleep/break in the middle of pg_create_logical_replication_slot execution, so I have no idea how to do that =( > >> What we can do here? > > I think the easiest fix might actually be to just ignore serialized > snapshots when building the initial snapshot. That's an option. However, that anyway requires the distinction between new slot creation and streaming from existing slot at snapbuilder level, which currently doesn't exist and which constitutes most part of my first patch. If we add that, changing between using and not using serialized snapshots in new slot creation is easy (my patch uses it), I think this is not principal. -- Arseny Sher
pgsql-bugs by date: