Thread: Post-mortem: final 2PC patch

Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 12:41:38

I've attached the 2PC patch as applied in case you want to have a look.
I did some fairly significant hacking on it, and I thought it'd be a
good idea to enumerate some of the changes:


I modified the in-memory data structure so that there is a separate
dummy PGPROC for each prepared transaction (GXACT), and we control
the visibility of the transaction to TransactionIdIsInProgress by
inserting or removing the PGPROC in the global ProcArray.  This costs
a little space but it has a lot of benefits:

* The 2PC feature demonstrably costs *nothing* when not used (other than
whatever distributed cost you might incur from a slightly larger backend
executable), because we simply did not add any code in the main paths.
In particular TransactionIdIsInProgress is no different from before.

* Subtransactions of a prepared transaction can usually be shown in the
subxids field of its PGPROC, thus usually saving a trip to pg_subtrans
for TransactionIdIsInProgress.  In the original approach I think that
TransactionIdIsInProgress would have had to examine pg_subtrans whenever
there were any prepared transactions, because it couldn't tell much of
anything about whether they had subtransactions.

* It's possible to distinguish which locks belong to which prepared
transaction, whereas the original approach of hooking all the locks
to one dummy PGPROC lost that information.  (Speaking of which, we need
to extend the pg_locks view again, but I'll bring that up in a separate
message to -hackers.)

* It's possible to do commit correctly ;-).  The commit sequence has
to look like
    1. Write COMMIT record to WAL, and fsync it.
    2. Write transaction commit status to pg_clog.
    3. Mark transaction as no longer running for
       TransactionIdIsInProgress.
    4. Release locks.
In the original coding, with TransactionIdIsInProgress looking directly
at the GXACT array, there wasn't any very good way to make the step-3
transition happen at the right point --- and in fact the patch as
submitted caused a gxact to drop out of TransactionIdIsInProgress before
it became committed, leaving a window where inquirers would incorrectly
conclude it had crashed.


Another area that I changed was fsyncing of prepared transaction status
files.  I really didn't like expecting checkpoint to do this --- for one
thing there are race conditions associated with fsyncing a status file
that someone else is busy deleting.  (You could possibly avoid that by
having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
but the performance implications of that are unpleasant.)  The
checkpoint coding "dealt with" the race condition by ignoring *all*
open() failures, which is hardly what I'd call a robust solution.
The way it's done in the committed patch, the prepare sequence is:
    1. Write the status file, append a deliberately bad CRC,
       and fsync it.
    2. Write the WAL record and fsync it.
    3. Overwrite the bad CRC with a good one and fsync.
It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
perspective I don't think there's much choice.  Steps 2 and 3 form a
critical section: if we complete 2 but fail at 3 we have to PANIC
because on-disk state is not consistent with what it says in WAL.  So
I felt we had to do the full pushup in step 1 to be as certain as we
can possibly be that step 3 won't fail.

I'm not totally satisfied with this --- it's OK from a correctness
standpoint but the performance leaves something to be desired.


I also fooled around a bit with locking of GXACTs.  In the original
patch you prevent two backends from trying to commit/rollback the same
GXACT by removing it from the pool at the start of the process.
That's not workable anymore, so what I did was to add a field
"locking_xid" showing the XID of the transaction currently working on
the GXACT.  As long as this XID is active according to ProcArray
(ignoring the prepared xact itself of course) the GXACT cannot be
touched by anyone else.  This provides a convenient way to interlock
creation of a GXACT as well as its eventual destruction.


I also changed the handling of smgr pending deletions.  The original
patch made smgr a 2PC resource manager, but that does not work because
XLOG replay doesn't (and shouldn't I think) run the resource managers;
thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
files should have been deleted.  The correct way to do this is that
COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
like a normal COMMIT or ABORT WAL record, including listing any files
to be deleted.  So I pulled the files-to-delete info into the 2PC file
header rather than keeping it in resource records.


Lots of lesser changes, but those are the biggies.

            regards, tom lane

Attachment

msg-22127-16811.bin

Re: Post-mortem: final 2PC patch

From

Stefan Kaltenbrunner

Date:

18 June 2005, 13:33:02

Tom Lane wrote:
> I've attached the 2PC patch as applied in case you want to have a look.
> I did some fairly significant hacking on it, and I thought it'd be a
> good idea to enumerate some of the changes:

FYI: this commit seems to cause problems/crashes during make check on my
OpenBSD/Sparc64 buildfarmclient:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04

I just checked manually with a fresh checkout - and I got similiar failures.

Stefan

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 13:40:50

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> FYI: this commit seems to cause problems/crashes during make check on my
> OpenBSD/Sparc64 buildfarmclient:

> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04

> I just checked manually with a fresh checkout - and I got similiar failures.

Can you get a stack trace from the core dump?

As best I can tell from the buildfarm report, one of the regression
tests is causing a sig11 --- but it's *not* the prepared_xacts test,
so I'm not sure the failure is related to this patch.

            regards, tom lane

Re: Post-mortem: final 2PC patch

From

Stefan Kaltenbrunner

Date:

18 June 2005, 13:51:12

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>
>>FYI: this commit seems to cause problems/crashes during make check on my
>>OpenBSD/Sparc64 buildfarmclient:
>
>
>>http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04
>
>
>>I just checked manually with a fresh checkout - and I got similiar failures.
>
>
> Can you get a stack trace from the core dump?

(gdb) bt
#0  0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2
#1  0x0000000000326efc in hash_search ()
#2  0x0000000000343430 in pg_tzset ()
#3  0x00000000001fbcf0 in assign_timezone ()
#4  0x0000000000330e88 in set_config_option ()
#5  0x000000000029b5b0 in PostgresMain ()
#6  0x000000000026b140 in BackendRun ()
#7  0x000000000026a9f0 in BackendStartup ()
#8  0x00000000002685bc in ServerLoop ()
#9  0x0000000000267b88 in PostmasterMain ()
#10 0x0000000000220cc8 in main ()

rebuilding with --enable-debug right now ...



>
> As best I can tell from the buildfarm report, one of the regression
> tests is causing a sig11 --- but it's *not* the prepared_xacts test,
> so I'm not sure the failure is related to this patch.

hmm - maybe one of the timezone patches ?


Stefan

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 13:53:19

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> Can you get a stack trace from the core dump?

> (gdb) bt
> #0  0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2
> #1  0x0000000000326efc in hash_search ()
> #2  0x0000000000343430 in pg_tzset ()
> #3  0x00000000001fbcf0 in assign_timezone ()
> #4  0x0000000000330e88 in set_config_option ()
> #5  0x000000000029b5b0 in PostgresMain ()
> #6  0x000000000026b140 in BackendRun ()
> #7  0x000000000026a9f0 in BackendStartup ()
> #8  0x00000000002685bc in ServerLoop ()
> #9  0x0000000000267b88 in PostmasterMain ()
> #10 0x0000000000220cc8 in main ()

> hmm - maybe one of the timezone patches ?

Looks suspicious, doesn't it.  How long since you last tested on that
machine?

            regards, tom lane

Re: Post-mortem: final 2PC patch

From

Stefan Kaltenbrunner

Date:

18 June 2005, 14:00:04

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>
>>>Can you get a stack trace from the core dump?
>
>
>>(gdb) bt
>>#0  0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2
>>#1  0x0000000000326efc in hash_search ()
>>#2  0x0000000000343430 in pg_tzset ()
>>#3  0x00000000001fbcf0 in assign_timezone ()
>>#4  0x0000000000330e88 in set_config_option ()
>>#5  0x000000000029b5b0 in PostgresMain ()
>>#6  0x000000000026b140 in BackendRun ()
>>#7  0x000000000026a9f0 in BackendStartup ()
>>#8  0x00000000002685bc in ServerLoop ()
>>#9  0x0000000000267b88 in PostmasterMain ()
>>#10 0x0000000000220cc8 in main ()
>
>
>>hmm - maybe one of the timezone patches ?
>
>
> Looks suspicious, doesn't it.  How long since you last tested on that
> machine?

*argl* - it's not 2PC ...

the machine had some issues a week ago or so - but it looks like the
problem occured first here:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-15%2023:50:04

and in that changeset we have some timezone-patches ...


Stefan

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 14:25:55

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> the machine had some issues a week ago or so - but it looks like the
> problem occured first here:

Hmm, what kind of issues, and are you sure they are fixed?

The stack trace looks to me like it is trying to apply the PGTZ setting
that's coming in from the client during startup.  Now I could believe a
platform-specific breakage there from Magnus' recent hacking, but the
problem is that *every* backend launched during the regression tests
is going to be doing this, and doing it in the same way with the same
value.  It's pretty tough to see why some backends would fail at this
spot and some not ... unless what it is is an intermittent hardware
problem.  Is there a version of memtest86 that works on your machine?

            regards, tom lane

Re: Post-mortem: final 2PC patch

From

Stefan Kaltenbrunner

Date:

18 June 2005, 14:58:13

Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>
>>the machine had some issues a week ago or so - but it looks like the
>>problem occured first here:
>
>
> Hmm, what kind of issues, and are you sure they are fixed?

admin error :-).
I had a second postgresql instance running causing the buildfarm runs to
report a failure because of shared memory-limits.
Had nothing to do with hardware problems though ...

>
> The stack trace looks to me like it is trying to apply the PGTZ setting
> that's coming in from the client during startup.  Now I could believe a
> platform-specific breakage there from Magnus' recent hacking, but the
> problem is that *every* backend launched during the regression tests
> is going to be doing this, and doing it in the same way with the same
> value.  It's pretty tough to see why some backends would fail at this
> spot and some not ... unless what it is is an intermittent hardware
> problem.  Is there a version of memtest86 that works on your machine?

memtest86 works on x86 CPU's only afaik - no ?

anyway - here is the promised backtrace:

#0  0x000000004489fba4 in memcpy () from /usr/lib/libc.so.34.2
#1  0x0000000000326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90,
action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653
#2  0x00000000003434f0 in pg_tzset (name=0xa5ff90 "PST8PDT") at pgtz.c:1039
#3  0x00000000001fbcf0 in assign_timezone (value=0xa5ff90 "PST8PDT",
doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351
#4  0x0000000000330f28 in set_config_option (name=0xa52830 "timezone",
value=0xa52870 "PST8PDT", context=10645504, source=PGC_S_CLIENT,
isLocal=0 '\0',
    changeVal=1 '\001') at guc.c:3748
#5  0x000000000029b5f0 in PostgresMain (argc=4, argv=0xa52928,
username=0xa52740 "mastermind") at postgres.c:2759
#6  0x000000000026b180 in BackendRun (port=0xa77800) at postmaster.c:2800
#7  0x000000000026aa30 in BackendStartup (port=0xa77800) at
postmaster.c:2440
#8  0x00000000002685fc in ServerLoop () at postmaster.c:1221
#9  0x0000000000267bc8 in PostmasterMain (argc=0,
argv=0xffffffffffff57d8) at postmaster.c:930
#10 0x0000000000220cc8 in main (argc=6, argv=0xffffffffffff57d8) at
main.c:268


Stefan

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 15:14:00

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> anyway - here is the promised backtrace:

> #0  0x000000004489fba4 in memcpy () from /usr/lib/libc.so.34.2
> #1  0x0000000000326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90,
> action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653
> #2  0x00000000003434f0 in pg_tzset (name=0xa5ff90 "PST8PDT") at pgtz.c:1039
> #3  0x00000000001fbcf0 in assign_timezone (value=0xa5ff90 "PST8PDT",
> doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351
> #4  0x0000000000330f28 in set_config_option (name=0xa52830 "timezone",
> value=0xa52870 "PST8PDT", context=10645504, source=PGC_S_CLIENT,
> isLocal=0 '\0',
>     changeVal=1 '\001') at guc.c:3748
> #5  0x000000000029b5f0 in PostgresMain (argc=4, argv=0xa52928,
> username=0xa52740 "mastermind") at postgres.c:2759

Yeah, that's exactly what I thought it was doing: setting timezone
PST8PDT due to the client PGTZ environment variable.  So the question
now is why this is an intermittent failure.  Why doesn't every single
regression test crash in the same place?

            regards, tom lane

Re: Post-mortem: final 2PC patch

From

Heikki Linnakangas

Date:

18 June 2005, 18:45:08

On Sat, 18 Jun 2005, Tom Lane wrote:

> I've attached the 2PC patch as applied in case you want to have a look.
> I did some fairly significant hacking on it, and I thought it'd be a
> good idea to enumerate some of the changes:
>
>
> I modified the in-memory data structure so that there is a separate
> dummy PGPROC for each prepared transaction (GXACT), and we control
> the visibility of the transaction to TransactionIdIsInProgress by
> inserting or removing the PGPROC in the global ProcArray.  This costs
> a little space but it has a lot of benefits:
>
> (..list of benefits..)

Yep, that looks much better. In general, a prepared transaction now looks
and behaves more like normal active transactions. That leaves less room
for error.

> Another area that I changed was fsyncing of prepared transaction status
> files.  I really didn't like expecting checkpoint to do this --- for one
> thing there are race conditions associated with fsyncing a status file
> that someone else is busy deleting.  (You could possibly avoid that by
> having the checkpointer hold the TwoPhaseStateLock while it's fsyncing,
> but the performance implications of that are unpleasant.)  The
> checkpoint coding "dealt with" the race condition by ignoring *all*
> open() failures, which is hardly what I'd call a robust solution.
>
> The way it's done in the committed patch, the prepare sequence is:
>     1. Write the status file, append a deliberately bad CRC,
>        and fsync it.
>     2. Write the WAL record and fsync it.
>     3. Overwrite the bad CRC with a good one and fsync.
> It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety
> perspective I don't think there's much choice.  Steps 2 and 3 form a
> critical section: if we complete 2 but fail at 3 we have to PANIC
> because on-disk state is not consistent with what it says in WAL.  So
> I felt we had to do the full pushup in step 1 to be as certain as we
> can possibly be that step 3 won't fail.
>
> I'm not totally satisfied with this --- it's OK from a correctness
> standpoint but the performance leaves something to be desired.

Ouch, that really hurts performance.

In typical 2PC use, the state files live for a very short period of time,
just long enough for the transaction manager to prepare all the resource
managers participating in the global transaction, and then commit them.
We're talking < 1 s. If we let checkpoint to fsync the state files, we
would only have to fsync those state files that happen to be alive when
the checkpoint comes. And if we fsync the state files at the end of the
checkpoint, after all the usual heap pages etc, it's very likely that
even those rare state files that were alive when the checkpoint began,
have already been deleted.

So we're really talking about 1 fsync vs. 3 fsyncs.

Can we figure out another way to solve the race condition? Would it
in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
considering that it usually wouldn't be held for long, since usually the
checkpoint would have very little work to do?

Was there other reasons beside the race condition why you did this way?

> I also fooled around a bit with locking of GXACTs.  In the original
> patch you prevent two backends from trying to commit/rollback the same
> GXACT by removing it from the pool at the start of the process.
> That's not workable anymore, so what I did was to add a field
> "locking_xid" showing the XID of the transaction currently working on
> the GXACT.  As long as this XID is active according to ProcArray
> (ignoring the prepared xact itself of course) the GXACT cannot be
> touched by anyone else.  This provides a convenient way to interlock
> creation of a GXACT as well as its eventual destruction.

Ok.

> I also changed the handling of smgr pending deletions.  The original
> patch made smgr a 2PC resource manager, but that does not work because
> XLOG replay doesn't (and shouldn't I think) run the resource managers;
> thus replay of a PREPARE/COMMIT sequence would fail to delete whatever
> files should have been deleted.  The correct way to do this is that
> COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly
> like a normal COMMIT or ABORT WAL record, including listing any files
> to be deleted.  So I pulled the files-to-delete info into the 2PC file
> header rather than keeping it in resource records.

Ok.

- Heikki

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 19:18:56

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sat, 18 Jun 2005, Tom Lane wrote:
>> I'm not totally satisfied with this --- it's OK from a correctness
>> standpoint but the performance leaves something to be desired.

> Ouch, that really hurts performance.

> In typical 2PC use, the state files live for a very short period of time,
> just long enough for the transaction manager to prepare all the resource
> managers participating in the global transaction, and then commit them.
> We're talking < 1 s. If we let checkpoint to fsync the state files, we
> would only have to fsync those state files that happen to be alive when
> the checkpoint comes.

That's a good point --- I was thinking this was basically 4 fsyncs per xact
(counting the additional WAL fsync needed for COMMIT PREPARED) versus 3,
but if the average lifetime of a state file is short then it's 4 vs 2,
and what's more the 2 are on WAL, which should be way cheaper than
fsyncing random files.

> And if we fsync the state files at the end of the
> checkpoint, after all the usual heap pages etc, it's very likely that
> even those rare state files that were alive when the checkpoint began,
> have already been deleted.

That argument is bogus, at least with respect to the way you were doing
it in the original patch, because what you were fsyncing was whatever
existed when CheckPointTwoPhase() started.  It could however be
interesting if we actually implemented something that checked the age of
the prepared xact.

> Can we figure out another way to solve the race condition? Would it
> in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
> considering that it usually wouldn't be held for long, since usually the
> checkpoint would have very little work to do?

If you're concerned about throughput of 2PC xacts then we can't sit on
the TwoPhaseStateLock while doing I/O; that will block both preparation
and commital of all 2PC xacts for a pretty long period in CPU terms.

Here's a sketch of an idea inspired by your comment above:

1. In each gxact in shared memory, store the WAL offset of the PREPARE
record, which we will know before we are ready to mark the gxact
"valid".

2. When CheckPointTwoPhase runs (which we'll put near the end of the
checkpoint sequence), the only gxacts that need to be fsync'd are those
that are marked valid and have a PREPARE WAL location older than the
checkpoint's redo horizon (anything newer will be replayed from WAL on
crash, so it doesn't need fsync to complete the checkpoint).  If you're
right that the lifespan of a state file is often shorter than the time
needed for a checkpoint, this wins big.  In any case we'll never have to
fsync state files that disappear before the next checkpoint.

3. One way to handle CheckPointTwoPhase is:

* At start, take TwoPhaseStateLock (can be in shared mode) for just long
enough to scan the gxact list and make a list of the XID of things that
need fsync per above rule.

* Without the lock, try to open and fsync each item in the list.
    Success: remove from list
    ENOENT failure on open: add to list of not-there failures
    Any other failure: ereport(ERROR)

* If the failure list is not empty, again take TwoPhaseStateLock in
shared mode, and check that each of the failures is now gone (or at
least marked invalid); if so it's OK, otherwise ereport the ENOENT
error.

Another possibility is to further extend the locking protocol for gxacts
so that the checkpointer can lock just the item it is fsyncing (which is
not possible at the moment because the checkpointer hasn't got an XID,
but probably we could think of another approach).  But that would
certainly delay attempts to commit the item being fsync'd, whereas the
above approach might not have to do so, depending on the filesystem
implementation.

Now there's a small problem with this approach, which is that we cannot
store the PREPARE WAL record location in the state files, since the
state file has to be completely computed before writing the WAL record.
However, we don't really need to do that: during recovery of a prepared
xact we know the thing has been fsynced (either originally, or when we
rewrote it during the WAL recovery sequence --- we can force an
immediate fsync in that one case).  So we can just put zero, or maybe
better the current end-of-WAL location, into the reconstructed gxact in
memory.

Thoughts?

            regards, tom lane

Re: Post-mortem: final 2PC patch

From

Heikki Linnakangas

Date:

18 June 2005, 20:00:51

On Sat, 18 Jun 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> Can we figure out another way to solve the race condition? Would it
>> in fact be ok for the checkpointer to hold the TwoPhaseStateLock,
>> considering that it usually wouldn't be held for long, since usually the
>> checkpoint would have very little work to do?
>
> If you're concerned about throughput of 2PC xacts then we can't sit on
> the TwoPhaseStateLock while doing I/O; that will block both preparation
> and commital of all 2PC xacts for a pretty long period in CPU terms.
>
> Here's a sketch of an idea inspired by your comment above:
>
> 1. In each gxact in shared memory, store the WAL offset of the PREPARE
> record, which we will know before we are ready to mark the gxact
> "valid".
>
> 2. When CheckPointTwoPhase runs (which we'll put near the end of the
> checkpoint sequence), the only gxacts that need to be fsync'd are those
> that are marked valid and have a PREPARE WAL location older than the
> checkpoint's redo horizon (anything newer will be replayed from WAL on
> crash, so it doesn't need fsync to complete the checkpoint).  If you're
> right that the lifespan of a state file is often shorter than the time
> needed for a checkpoint, this wins big.  In any case we'll never have to
> fsync state files that disappear before the next checkpoint.
>
> 3. One way to handle CheckPointTwoPhase is:
>
> * At start, take TwoPhaseStateLock (can be in shared mode) for just long
> enough to scan the gxact list and make a list of the XID of things that
> need fsync per above rule.
>
> * Without the lock, try to open and fsync each item in the list.
>     Success: remove from list
>     ENOENT failure on open: add to list of not-there failures
>     Any other failure: ereport(ERROR)
>
> * If the failure list is not empty, again take TwoPhaseStateLock in
> shared mode, and check that each of the failures is now gone (or at
> least marked invalid); if so it's OK, otherwise ereport the ENOENT
> error.

In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is
marked as valid after the prepare record is written to WAL. If checkpoint
runs after the WAL record is written but before the gxact is marked as
valid, it doesn't get fsynced. Right?

Otherwise, looks good to me.

> Another possibility is to further extend the locking protocol for gxacts
> so that the checkpointer can lock just the item it is fsyncing (which is
> not possible at the moment because the checkpointer hasn't got an XID,
> but probably we could think of another approach).  But that would
> certainly delay attempts to commit the item being fsync'd, whereas the
> above approach might not have to do so, depending on the filesystem
> implementation.

The above sketch is much better.

> Now there's a small problem with this approach, which is that we cannot
> store the PREPARE WAL record location in the state files, since the
> state file has to be completely computed before writing the WAL record.
> However, we don't really need to do that: during recovery of a prepared
> xact we know the thing has been fsynced (either originally, or when we
> rewrote it during the WAL recovery sequence --- we can force an
> immediate fsync in that one case).  So we can just put zero, or maybe
> better the current end-of-WAL location, into the reconstructed gxact in
> memory.

This reminds me of something. What should we do about XID wraparounds and
prepared transactions? Should we have some mechanism to freeze prepared
transactions, like heap tuples? At the minimum, I think we should issue a
warning if the xid counter approaches the oldest prepared transaction.

A transaction shouldn't live that long in normal use, but I can imagine an
orphaned transaction sitting there for years if it doesn't hold any locks
etc that bother other applications.

I don't think we should implement heuristic commit/rollback, though.
That creates a whole new class of problems.

- Heikki

Re: Post-mortem: final 2PC patch

From

Tom Lane

Date:

18 June 2005, 20:16:54

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is
> marked as valid after the prepare record is written to WAL. If checkpoint
> runs after the WAL record is written but before the gxact is marked as
> valid, it doesn't get fsynced. Right?

It's safe because we have the CheckpointStartLock: no checkpoint
occurring in that interval can be trying to set its redo pointer
after the PREPARE record.  This is essentially the same race condition
previously detected for COMMIT vs. writing clog, and we need the
interlock with either approach to fsyncing 2PC.

> This reminds me of something. What should we do about XID wraparounds and
> prepared transactions?

I don't think we need any special protection.  The prepared xacts will
be blocking advancement of GlobalXmin, and the (new in 8.1) XID wrap
detection code will shut us down safely.

> A transaction shouldn't live that long in normal use, but I can imagine an
> orphaned transaction sitting there for years if it doesn't hold any locks
> etc that bother other applications.

No, because people will notice the effects on VACUUM if nothing else.

> I don't think we should implement heuristic commit/rollback, though.
> That creates a whole new class of problems.

Agreed.  I did put in the time-of-prepare display in pg_prepared_xacts,
so anyone who really wants this can program the behavior they want from
outside the database.

            regards, tom lane