Re: Post-mortem: final 2PC patch - Mailing list pgsql-patches

From Tom Lane
Subject Re: Post-mortem: final 2PC patch
Date
Msg-id 8641.1119133122@sss.pgh.pa.us
Whole thread Raw
In response to Re: Post-mortem: final 2PC patch  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Post-mortem: final 2PC patch
List pgsql-patches
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sat, 18 Jun 2005, Tom Lane wrote:
>> I'm not totally satisfied with this --- it's OK from a correctness
>> standpoint but the performance leaves something to be desired.

> Ouch, that really hurts performance.

> In typical 2PC use, the state files live for a very short period of time,
> just long enough for the transaction manager to prepare all the resource
> managers participating in the global transaction, and then commit them.
> We're talking < 1 s. If we let checkpoint to fsync the state files, we
> would only have to fsync those state files that happen to be alive when
> the checkpoint comes.

That's a good point --- I was thinking this was basically 4 fsyncs per xact
(counting the additional WAL fsync needed for COMMIT PREPARED) versus 3,
but if the average lifetime of a state file is short then it's 4 vs 2,
and what's more the 2 are on WAL, which should be way cheaper than
fsyncing random files.

> And if we fsync the state files at the end of the
> checkpoint, after all the usual heap pages etc, it's very likely that
> even those rare state files that were alive when the checkpoint began,
> have already been deleted.

That argument is bogus, at least with respect to the way you were doing
it in the original patch, because what you were fsyncing was whatever
existed when CheckPointTwoPhase() started.  It could however be
interesting if we actually implemented something that checked the age of
the prepared xact.

> Can we figure out another way to solve the race condition? Would it
> in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
> considering that it usually wouldn't be held for long, since usually the
> checkpoint would have very little work to do?

If you're concerned about throughput of 2PC xacts then we can't sit on
the TwoPhaseStateLock while doing I/O; that will block both preparation
and commital of all 2PC xacts for a pretty long period in CPU terms.

Here's a sketch of an idea inspired by your comment above:

1. In each gxact in shared memory, store the WAL offset of the PREPARE
record, which we will know before we are ready to mark the gxact
"valid".

2. When CheckPointTwoPhase runs (which we'll put near the end of the
checkpoint sequence), the only gxacts that need to be fsync'd are those
that are marked valid and have a PREPARE WAL location older than the
checkpoint's redo horizon (anything newer will be replayed from WAL on
crash, so it doesn't need fsync to complete the checkpoint).  If you're
right that the lifespan of a state file is often shorter than the time
needed for a checkpoint, this wins big.  In any case we'll never have to
fsync state files that disappear before the next checkpoint.

3. One way to handle CheckPointTwoPhase is:

* At start, take TwoPhaseStateLock (can be in shared mode) for just long
enough to scan the gxact list and make a list of the XID of things that
need fsync per above rule.

* Without the lock, try to open and fsync each item in the list.
    Success: remove from list
    ENOENT failure on open: add to list of not-there failures
    Any other failure: ereport(ERROR)

* If the failure list is not empty, again take TwoPhaseStateLock in
shared mode, and check that each of the failures is now gone (or at
least marked invalid); if so it's OK, otherwise ereport the ENOENT
error.

Another possibility is to further extend the locking protocol for gxacts
so that the checkpointer can lock just the item it is fsyncing (which is
not possible at the moment because the checkpointer hasn't got an XID,
but probably we could think of another approach).  But that would
certainly delay attempts to commit the item being fsync'd, whereas the
above approach might not have to do so, depending on the filesystem
implementation.

Now there's a small problem with this approach, which is that we cannot
store the PREPARE WAL record location in the state files, since the
state file has to be completely computed before writing the WAL record.
However, we don't really need to do that: during recovery of a prepared
xact we know the thing has been fsynced (either originally, or when we
rewrote it during the WAL recovery sequence --- we can force an
immediate fsync in that one case).  So we can just put zero, or maybe
better the current end-of-WAL location, into the reconstructed gxact in
memory.

Thoughts?

            regards, tom lane

pgsql-patches by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Post-mortem: final 2PC patch
Next
From: Heikki Linnakangas
Date:
Subject: Re: Post-mortem: final 2PC patch