Post-mortem: final 2PC patch - Mailing list pgsql-patches

From Tom Lane
Subject Post-mortem: final 2PC patch
Date
Msg-id 13905.1119109281@sss.pgh.pa.us
Whole thread Raw
Responses Re: Post-mortem: final 2PC patch
Re: Post-mortem: final 2PC patch
List pgsql-patches
I've attached the 2PC patch as applied in case you want to have a look.
I did some fairly significant hacking on it, and I thought it'd be a
good idea to enumerate some of the changes:


I modified the in-memory data structure so that there is a separate
dummy PGPROC for each prepared transaction (GXACT), and we control
the visibility of the transaction to TransactionIdIsInProgress by
inserting or removing the PGPROC in the global ProcArray.  This costs
a little space but it has a lot of benefits:

* The 2PC feature demonstrably costs *nothing* when not used (other than
whatever distributed cost you might incur from a slightly larger backend
executable), because we simply did not add any code in the main paths.
In particular TransactionIdIsInProgress is no different from before.

* Subtransactions of a prepared transaction can usually be shown in the
subxids field of its PGPROC, thus usually saving a trip to pg_subtrans
for TransactionIdIsInProgress.  In the original approach I think that
TransactionIdIsInProgress would have had to examine pg_subtrans whenever
there were any prepared transactions, because it couldn't tell much of
anything about whether they had subtransactions.

* It's possible to distinguish which locks belong to which prepared
transaction, whereas the original approach of hooking all the locks
to one dummy PGPROC lost that information.  (Speaking of which, we need
to extend the pg_locks view again, but I'll bring that up in a separate
message to -hackers.)

* It's possible to do commit correctly ;-).  The commit sequence has
to look like
    1. Write COMMIT record to WAL, and fsync it.
    2. Write transaction commit status to pg_clog.
    3. Mark transaction as no longer running for
       TransactionIdIsInProgress.
    4. Release locks.
In the original coding, with TransactionIdIsInProgress looking directly
at the GXACT array, there wasn't any very good way to make the step-3
transition happen at the right point --- and in fact the patch as
submitted caused a gxact to drop out of TransactionIdIsInProgress before
it became committed, leaving a window where inquirers would incorrectly
conclude it had crashed.


Another area that I changed was fsyncing of prepared transaction status
files.  I really didn't like expecting checkpoint to do this --- for one
thing there are race conditions associated with fsyncing a status file
that someone else is busy deleting.  (You could possibly avoid that by
having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
but the performance implications of that are unpleasant.)  The
checkpoint coding "dealt with" the race condition by ignoring *all*
open() failures, which is hardly what I'd call a robust solution.
The way it's done in the committed patch, the prepare sequence is:
    1. Write the status file, append a deliberately bad CRC,
       and fsync it.
    2. Write the WAL record and fsync it.
    3. Overwrite the bad CRC with a good one and fsync.
It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
perspective I don't think there's much choice.  Steps 2 and 3 form a
critical section: if we complete 2 but fail at 3 we have to PANIC
because on-disk state is not consistent with what it says in WAL.  So
I felt we had to do the full pushup in step 1 to be as certain as we
can possibly be that step 3 won't fail.

I'm not totally satisfied with this --- it's OK from a correctness
standpoint but the performance leaves something to be desired.


I also fooled around a bit with locking of GXACTs.  In the original
patch you prevent two backends from trying to commit/rollback the same
GXACT by removing it from the pool at the start of the process.
That's not workable anymore, so what I did was to add a field
"locking_xid" showing the XID of the transaction currently working on
the GXACT.  As long as this XID is active according to ProcArray
(ignoring the prepared xact itself of course) the GXACT cannot be
touched by anyone else.  This provides a convenient way to interlock
creation of a GXACT as well as its eventual destruction.


I also changed the handling of smgr pending deletions.  The original
patch made smgr a 2PC resource manager, but that does not work because
XLOG replay doesn't (and shouldn't I think) run the resource managers;
thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
files should have been deleted.  The correct way to do this is that
COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
like a normal COMMIT or ABORT WAL record, including listing any files
to be deleted.  So I pulled the files-to-delete info into the 2PC file
header rather than keeping it in resource records.


Lots of lesser changes, but those are the biggies.

            regards, tom lane



Attachment

pgsql-patches by date:

Previous
From: Tom Lane
Date:
Subject: Re: TODO Item - Return compressed length of TOAST datatypes (WIP)
Next
From: Stefan Kaltenbrunner
Date:
Subject: Re: Post-mortem: final 2PC patch