Post-mortem: final 2PC patch - Mailing list pgsql-patches
From | Tom Lane |
---|---|
Subject | Post-mortem: final 2PC patch |
Date | |
Msg-id | 13905.1119109281@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Post-mortem: final 2PC patch
Re: Post-mortem: final 2PC patch |
List | pgsql-patches |
I've attached the 2PC patch as applied in case you want to have a look. I did some fairly significant hacking on it, and I thought it'd be a good idea to enumerate some of the changes: I modified the in-memory data structure so that there is a separate dummy PGPROC for each prepared transaction (GXACT), and we control the visibility of the transaction to TransactionIdIsInProgress by inserting or removing the PGPROC in the global ProcArray. This costs a little space but it has a lot of benefits: * The 2PC feature demonstrably costs *nothing* when not used (other than whatever distributed cost you might incur from a slightly larger backend executable), because we simply did not add any code in the main paths. In particular TransactionIdIsInProgress is no different from before. * Subtransactions of a prepared transaction can usually be shown in the subxids field of its PGPROC, thus usually saving a trip to pg_subtrans for TransactionIdIsInProgress. In the original approach I think that TransactionIdIsInProgress would have had to examine pg_subtrans whenever there were any prepared transactions, because it couldn't tell much of anything about whether they had subtransactions. * It's possible to distinguish which locks belong to which prepared transaction, whereas the original approach of hooking all the locks to one dummy PGPROC lost that information. (Speaking of which, we need to extend the pg_locks view again, but I'll bring that up in a separate message to -hackers.) * It's possible to do commit correctly ;-). The commit sequence has to look like 1. Write COMMIT record to WAL, and fsync it. 2. Write transaction commit status to pg_clog. 3. Mark transaction as no longer running for TransactionIdIsInProgress. 4. Release locks. In the original coding, with TransactionIdIsInProgress looking directly at the GXACT array, there wasn't any very good way to make the step-3 transition happen at the right point --- and in fact the patch as submitted caused a gxact to drop out of TransactionIdIsInProgress before it became committed, leaving a window where inquirers would incorrectly conclude it had crashed. Another area that I changed was fsyncing of prepared transaction status files. I really didn't like expecting checkpoint to do this --- for one thing there are race conditions associated with fsyncing a status file that someone else is busy deleting. (You could possibly avoid that by having the checkpointer hold the TwoPhaseStateLock while it's fsyncing, but the performance implications of that are unpleasant.) The checkpoint coding "dealt with" the race condition by ignoring *all* open() failures, which is hardly what I'd call a robust solution. The way it's done in the committed patch, the prepare sequence is: 1. Write the status file, append a deliberately bad CRC, and fsync it. 2. Write the WAL record and fsync it. 3. Overwrite the bad CRC with a good one and fsync. It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety perspective I don't think there's much choice. Steps 2 and 3 form a critical section: if we complete 2 but fail at 3 we have to PANIC because on-disk state is not consistent with what it says in WAL. So I felt we had to do the full pushup in step 1 to be as certain as we can possibly be that step 3 won't fail. I'm not totally satisfied with this --- it's OK from a correctness standpoint but the performance leaves something to be desired. I also fooled around a bit with locking of GXACTs. In the original patch you prevent two backends from trying to commit/rollback the same GXACT by removing it from the pool at the start of the process. That's not workable anymore, so what I did was to add a field "locking_xid" showing the XID of the transaction currently working on the GXACT. As long as this XID is active according to ProcArray (ignoring the prepared xact itself of course) the GXACT cannot be touched by anyone else. This provides a convenient way to interlock creation of a GXACT as well as its eventual destruction. I also changed the handling of smgr pending deletions. The original patch made smgr a 2PC resource manager, but that does not work because XLOG replay doesn't (and shouldn't I think) run the resource managers; thus replay of a PREPARE/COMMIT sequence would fail to delete whatever files should have been deleted. The correct way to do this is that COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly like a normal COMMIT or ABORT WAL record, including listing any files to be deleted. So I pulled the files-to-delete info into the 2PC file header rather than keeping it in resource records. Lots of lesser changes, but those are the biggies. regards, tom lane
Attachment
pgsql-patches by date: