Re: Post-mortem: final 2PC patch - Mailing list pgsql-patches

From Heikki Linnakangas
Subject Re: Post-mortem: final 2PC patch
Date
Msg-id Pine.OSF.4.61.0506182346190.262699@kosh.hut.fi
Whole thread Raw
In response to Post-mortem: final 2PC patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Post-mortem: final 2PC patch
List pgsql-patches
On Sat, 18 Jun 2005, Tom Lane wrote:

> I've attached the 2PC patch as applied in case you want to have a look.
> I did some fairly significant hacking on it, and I thought it'd be a
> good idea to enumerate some of the changes:
>
>
> I modified the in-memory data structure so that there is a separate
> dummy PGPROC for each prepared transaction (GXACT), and we control
> the visibility of the transaction to TransactionIdIsInProgress by
> inserting or removing the PGPROC in the global ProcArray.  This costs
> a little space but it has a lot of benefits:
>
> (..list of benefits..)

Yep, that looks much better. In general, a prepared transaction now looks
and behaves more like normal active transactions. That leaves less room
for error.

> Another area that I changed was fsyncing of prepared transaction status
> files.  I really didn't like expecting checkpoint to do this --- for one
> thing there are race conditions associated with fsyncing a status file
> that someone else is busy deleting.  (You could possibly avoid that by
> having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
> but the performance implications of that are unpleasant.)  The
> checkpoint coding "dealt with" the race condition by ignoring *all*
> open() failures, which is hardly what I'd call a robust solution.
>
> The way it's done in the committed patch, the prepare sequence is:
>     1. Write the status file, append a deliberately bad CRC,
>        and fsync it.
>     2. Write the WAL record and fsync it.
>     3. Overwrite the bad CRC with a good one and fsync.
> It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
> perspective I don't think there's much choice.  Steps 2 and 3 form a
> critical section: if we complete 2 but fail at 3 we have to PANIC
> because on-disk state is not consistent with what it says in WAL.  So
> I felt we had to do the full pushup in step 1 to be as certain as we
> can possibly be that step 3 won't fail.
>
> I'm not totally satisfied with this --- it's OK from a correctness
> standpoint but the performance leaves something to be desired.

Ouch, that really hurts performance.

In typical 2PC use, the state files live for a very short period of time,
just long enough for the transaction manager to prepare all the resource
managers participating in the global transaction, and then commit them.
We're talking < 1 s. If we let checkpoint to fsync the state files, we
would only have to fsync those state files that happen to be alive when
the checkpoint comes. And if we fsync the state files at the end of the
checkpoint, after all the usual heap pages etc, it's very likely that
even those rare state files that were alive when the checkpoint began,
have already been deleted.

So we're really talking about 1 fsync vs. 3 fsyncs.

Can we figure out another way to solve the race condition? Would it
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the
checkpoint would have very little work to do?

Was there other reasons beside the race condition why you did this way?

> I also fooled around a bit with locking of GXACTs.  In the original
> patch you prevent two backends from trying to commit/rollback the same
> GXACT by removing it from the pool at the start of the process.
> That's not workable anymore, so what I did was to add a field
> "locking_xid" showing the XID of the transaction currently working on
> the GXACT.  As long as this XID is active according to ProcArray
> (ignoring the prepared xact itself of course) the GXACT cannot be
> touched by anyone else.  This provides a convenient way to interlock
> creation of a GXACT as well as its eventual destruction.

Ok.

> I also changed the handling of smgr pending deletions.  The original
> patch made smgr a 2PC resource manager, but that does not work because
> XLOG replay doesn't (and shouldn't I think) run the resource managers;
> thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
> files should have been deleted.  The correct way to do this is that
> COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
> like a normal COMMIT or ABORT WAL record, including listing any files
> to be deleted.  So I pulled the files-to-delete info into the 2PC file
> header rather than keeping it in resource records.

Ok.

- Heikki

pgsql-patches by date:

Previous
From: Michael Fuhr
Date:
Subject: Typo in pl_funcs.c comment
Next
From: Tom Lane
Date:
Subject: Re: Post-mortem: final 2PC patch