Re: [HACKERS] logical decoding of two-phase transactions - Mailing list pgsql-hackers
From | Arseny Sher |
---|---|
Subject | Re: [HACKERS] logical decoding of two-phase transactions |
Date | |
Msg-id | 87zhxrwgvh.fsf@ars-thinkpad Whole thread Raw |
In response to | Re: [HACKERS] logical decoding of two-phase transactions (Nikhil Sontakke <nikhils@2ndquadrant.com>) |
Responses |
Re: [HACKERS] logical decoding of two-phase transactions
|
List | pgsql-hackers |
Nikhil Sontakke <nikhils@2ndquadrant.com> writes: >> - Decoding transactions at PREPARE record changes rules of the "we ship >> all commits after lsn 'x'" game. Namely, it will break initial >> tablesync: what if consistent snapshot was formed *after* PREPARE, but >> before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead >> of getting inital contents + continious stream of changes the receiver >> will miss the prepared xact contents and raise 'prepared xact doesn't >> exist' error. I think the starting point to address this is to forbid >> two-phase decoding of xacts with lsn of PREPARE less than >> snapbuilder's start_decoding_at. >> > > It will be the job of the plugin to return a consistent answer for > every GID that is encountered. In this case, the plugin will decode > the transaction at COMMIT PREPARED time and not at PREPARE time. I can't imagine a scenario in which plugin would want to send COMMIT PREPARED instead of replaying xact fully on CP record given it had never seen PREPARE record. On the other hand, tracking such situations on plugins side would make plugins life unneccessary complicated: either it has to dig into snapbuilder/slot internals to learn when the snapshot became consistent (which currently is impossible as this lsn is not saved anywhere btw), or it must fsync each its decision to do or not to do 2PC. Technically, my concern covers not only tablesync, but just plain decoding start: we don't want to ship COMMIT PREPARED if the downstream had never had chance to see PREPARE. As for tablesync, looking at current implementation I contemplate that we would need to do something along the following lines: - Tablesync worker performs COPY. - It then speaks with main apply worker to arrange (origin) lsn of sync point, as it does now. - Tablesync worker applies changes up to arranged lsn; it never uses two-phase decoding, all xacts are replayed on COMMIT PREPARED. Moreover, instead of going into SYNCDONE state immediately after reaching needed lsn, it stops replaying usual commits but continues to receive changes to finish all transactions which were prepared before sync point (we would need some additional support from reorderbuffer to learn when this happens). Only then it goes into SYNCDONE. - Behaviour of the main apply worker doesn't change: it ignores changes of the table in question before sync point and applies them after sync point. It also can use 2PC decoding of any transaction or not, as it desires. I believe this approach would implement tablesync correctly (all changes are applied, but only once) with minimal fuss. >> - Currently we will call abort_prepared cb even if we failed to actually >> prepare xact due to concurrent abort. I think it is confusing for >> users. We should either handle this by remembering not to invoke >> abort_prepared in these cases or at least document this behaviour, >> leaving this problem to the receiver side. > > The point is, when we reach the "ROLLBACK PREPARED", we have no idea > if the "PREPARE" was aborted by this rollback happening concurrently. > So it's possible that the 2PC has been successfully decoded and we > would have to send the rollback to the other side which would need to > check if it needs to rollback locally. I understand this. But I find this confusing for the users, so I propose to - Either document that "you might get abort_prepared cb called even after abort cb was invoked for the same transaction"; - Or consider adding some infrastructure to reorderbuffer to remember not to call abort_prepared in these cases. Due to possible reboots, I think this means that we need not to ReorderBufferCleanupTXN immediately after failed attempt to replay xact on PREPARE, but mark it as 'aborted' and keep it until we see ABORT PREPARED record. If we see that xact is marked as aborted, we don't call abort_prepared_cb. That way even if the decoder restarts in between, we will see PREPARE in WAL, inquire xact status (even if we skip it as already replayed) and mark it as aborted again. >> - I find it suspicious that DecodePrepare completely ignores actions of >> SnapBuildCommitTxn. For example, to execute invalidations, the latter >> sets base snapshot if our xact (or subxacts) did DDL and the snapshot >> not set yet. My fantasy doesn't hint me the concrete example >> where this would burn at the moment, but it should be considered. >> > > I had discussed this area with Petr and we didn't see any issues as well, then. > Ok, simplifying, what SnapBuildCommitTxn practically does is * Decide whether we are interested in tracking this xact effects, and if we are, mark it as committed. * Build and distribute snapshot to all RBTXNs, if it is important. * Set base snap of our xact if it did DDL, to execute invalidations during replay. I see that we don't need to do first two bullets during DecodePrepare: xact effects are still invisible for everyone but itself after PREPARE. As for seeing xacts own own changes, it is implemented via logging cmin/cmax and we don't need to mark xact as committed for that (c.f ReorderBufferCopySnap). Regarding the third point... I think in 2PC decoding we might need to execute invalidations twice: 1) After replaying xact on PREPARE to forget about catalog changes xact did -- it is not yet committed and must be invisible to other xacts until CP. In the latest patchset invalidations are executed only if there is at least one change in xact (it has base snap). It looks fine: we can't spoil catalogs if there were nothing to decode. Better to explain that somewhere. 2) After decoding COMMIT PREPARED to make changes visible. In current patchset it is always done. Actually, *this* is the reason RBTXN might already exist when we enter ReorderBufferFinishPrepared, not "(during restarts for example)" as comment says there: if there were inval messages, RBTXN will be created in DecodeCommit during their addition. BTW, "that we might need to execute invalidations, add snapshot" in SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to do that at ReorderBufferXidSetCatalogChanges. Now, another issue is registering xact as committed in SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is always purged after xact replay on PREPARE, the only medium we have for noticing catalog changes during COMMIT PREPARED is invalidation messages attached to the CP record. This raises the following question. * If there is a guarantee that whenever xact makes catalog changes it generates invalidation messages, then this code is fine. However, currently ReorderBufferXidSetCatalogChanges is also called on XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which is useless if such guarantee exists. * If, on the other hand, there is no such guarantee, this code is broken. >> - I am one of those people upthread who don't think that converting >> flags to bitmask is beneficial -- especially given that many of them >> are mutually exclusive, e.g. xact can't be committed and aborted at >> the same time. Apparently you have left this to the committer though. >> > > Hmm, there seems to be divided opinion on this. I am willing to go > back to using the booleans if there's opposition and if the committer > so wishes. Note that this patch will end up adding 4/5 more booleans > in that case (we add new ones for prepare, commit prepare, abort, > rollback prepare etc). Well, you can unite mutually exclusive fields into one enum or char with macros defining possible values. Transaction can't be committed and aborted at the same time, etc. >> + row. The <function>change_cb</function> callback may access system or >> + user catalog tables to aid in the process of outputting the row >> + modification details. In case of decoding a prepared (but yet >> + uncommitted) transaction or decoding of an uncommitted transaction, this >> + change callback is ensured sane access to catalog tables regardless of >> + simultaneous rollback by another backend of this very same transaction. >> >> I don't think we should explain this, at least in such words. As >> mentioned upthread, we should warn about allowed systable_* accesses >> instead. Same for message_cb. >> > > Looks like you are looking at an earlier patchset. The latest patchset > has removed the above. I see, sorry. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: