Thread: Post-mortem: final 2PC patch
I've attached the 2PC patch as applied in case you want to have a look. I did some fairly significant hacking on it, and I thought it'd be a good idea to enumerate some of the changes: I modified the in-memory data structure so that there is a separate dummy PGPROC for each prepared transaction (GXACT), and we control the visibility of the transaction to TransactionIdIsInProgress by inserting or removing the PGPROC in the global ProcArray. This costs a little space but it has a lot of benefits: * The 2PC feature demonstrably costs *nothing* when not used (other than whatever distributed cost you might incur from a slightly larger backend executable), because we simply did not add any code in the main paths. In particular TransactionIdIsInProgress is no different from before. * Subtransactions of a prepared transaction can usually be shown in the subxids field of its PGPROC, thus usually saving a trip to pg_subtrans for TransactionIdIsInProgress. In the original approach I think that TransactionIdIsInProgress would have had to examine pg_subtrans whenever there were any prepared transactions, because it couldn't tell much of anything about whether they had subtransactions. * It's possible to distinguish which locks belong to which prepared transaction, whereas the original approach of hooking all the locks to one dummy PGPROC lost that information. (Speaking of which, we need to extend the pg_locks view again, but I'll bring that up in a separate message to -hackers.) * It's possible to do commit correctly ;-). The commit sequence has to look like 1. Write COMMIT record to WAL, and fsync it. 2. Write transaction commit status to pg_clog. 3. Mark transaction as no longer running for TransactionIdIsInProgress. 4. Release locks. In the original coding, with TransactionIdIsInProgress looking directly at the GXACT array, there wasn't any very good way to make the step-3 transition happen at the right point --- and in fact the patch as submitted caused a gxact to drop out of TransactionIdIsInProgress before it became committed, leaving a window where inquirers would incorrectly conclude it had crashed. Another area that I changed was fsyncing of prepared transaction status files. I really didn't like expecting checkpoint to do this --- for one thing there are race conditions associated with fsyncing a status file that someone else is busy deleting. (You could possibly avoid that by having the checkpointer hold the TwoPhaseStateLock while it's fsyncing, but the performance implications of that are unpleasant.) The checkpoint coding "dealt with" the race condition by ignoring *all* open() failures, which is hardly what I'd call a robust solution. The way it's done in the committed patch, the prepare sequence is: 1. Write the status file, append a deliberately bad CRC, and fsync it. 2. Write the WAL record and fsync it. 3. Overwrite the bad CRC with a good one and fsync. It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety perspective I don't think there's much choice. Steps 2 and 3 form a critical section: if we complete 2 but fail at 3 we have to PANIC because on-disk state is not consistent with what it says in WAL. So I felt we had to do the full pushup in step 1 to be as certain as we can possibly be that step 3 won't fail. I'm not totally satisfied with this --- it's OK from a correctness standpoint but the performance leaves something to be desired. I also fooled around a bit with locking of GXACTs. In the original patch you prevent two backends from trying to commit/rollback the same GXACT by removing it from the pool at the start of the process. That's not workable anymore, so what I did was to add a field "locking_xid" showing the XID of the transaction currently working on the GXACT. As long as this XID is active according to ProcArray (ignoring the prepared xact itself of course) the GXACT cannot be touched by anyone else. This provides a convenient way to interlock creation of a GXACT as well as its eventual destruction. I also changed the handling of smgr pending deletions. The original patch made smgr a 2PC resource manager, but that does not work because XLOG replay doesn't (and shouldn't I think) run the resource managers; thus replay of a PREPARE/COMMIT sequence would fail to delete whatever files should have been deleted. The correct way to do this is that COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly like a normal COMMIT or ABORT WAL record, including listing any files to be deleted. So I pulled the files-to-delete info into the 2PC file header rather than keeping it in resource records. Lots of lesser changes, but those are the biggies. regards, tom lane
Attachment
Tom Lane wrote: > I've attached the 2PC patch as applied in case you want to have a look. > I did some fairly significant hacking on it, and I thought it'd be a > good idea to enumerate some of the changes: FYI: this commit seems to cause problems/crashes during make check on my OpenBSD/Sparc64 buildfarmclient: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04 I just checked manually with a fresh checkout - and I got similiar failures. Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > FYI: this commit seems to cause problems/crashes during make check on my > OpenBSD/Sparc64 buildfarmclient: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04 > I just checked manually with a fresh checkout - and I got similiar failures. Can you get a stack trace from the core dump? As best I can tell from the buildfarm report, one of the regression tests is causing a sig11 --- but it's *not* the prepared_xacts test, so I'm not sure the failure is related to this patch. regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>FYI: this commit seems to cause problems/crashes during make check on my >>OpenBSD/Sparc64 buildfarmclient: > > >>http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-17%2023:50:04 > > >>I just checked manually with a fresh checkout - and I got similiar failures. > > > Can you get a stack trace from the core dump? (gdb) bt #0 0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2 #1 0x0000000000326efc in hash_search () #2 0x0000000000343430 in pg_tzset () #3 0x00000000001fbcf0 in assign_timezone () #4 0x0000000000330e88 in set_config_option () #5 0x000000000029b5b0 in PostgresMain () #6 0x000000000026b140 in BackendRun () #7 0x000000000026a9f0 in BackendStartup () #8 0x00000000002685bc in ServerLoop () #9 0x0000000000267b88 in PostmasterMain () #10 0x0000000000220cc8 in main () rebuilding with --enable-debug right now ... > > As best I can tell from the buildfarm report, one of the regression > tests is causing a sig11 --- but it's *not* the prepared_xacts test, > so I'm not sure the failure is related to this patch. hmm - maybe one of the timezone patches ? Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: >> Can you get a stack trace from the core dump? > (gdb) bt > #0 0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2 > #1 0x0000000000326efc in hash_search () > #2 0x0000000000343430 in pg_tzset () > #3 0x00000000001fbcf0 in assign_timezone () > #4 0x0000000000330e88 in set_config_option () > #5 0x000000000029b5b0 in PostgresMain () > #6 0x000000000026b140 in BackendRun () > #7 0x000000000026a9f0 in BackendStartup () > #8 0x00000000002685bc in ServerLoop () > #9 0x0000000000267b88 in PostmasterMain () > #10 0x0000000000220cc8 in main () > hmm - maybe one of the timezone patches ? Looks suspicious, doesn't it. How long since you last tested on that machine? regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>>Can you get a stack trace from the core dump? > > >>(gdb) bt >>#0 0x0000000050377ba4 in memcpy () from /usr/lib/libc.so.34.2 >>#1 0x0000000000326efc in hash_search () >>#2 0x0000000000343430 in pg_tzset () >>#3 0x00000000001fbcf0 in assign_timezone () >>#4 0x0000000000330e88 in set_config_option () >>#5 0x000000000029b5b0 in PostgresMain () >>#6 0x000000000026b140 in BackendRun () >>#7 0x000000000026a9f0 in BackendStartup () >>#8 0x00000000002685bc in ServerLoop () >>#9 0x0000000000267b88 in PostmasterMain () >>#10 0x0000000000220cc8 in main () > > >>hmm - maybe one of the timezone patches ? > > > Looks suspicious, doesn't it. How long since you last tested on that > machine? *argl* - it's not 2PC ... the machine had some issues a week ago or so - but it looks like the problem occured first here: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2005-06-15%2023:50:04 and in that changeset we have some timezone-patches ... Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > the machine had some issues a week ago or so - but it looks like the > problem occured first here: Hmm, what kind of issues, and are you sure they are fixed? The stack trace looks to me like it is trying to apply the PGTZ setting that's coming in from the client during startup. Now I could believe a platform-specific breakage there from Magnus' recent hacking, but the problem is that *every* backend launched during the regression tests is going to be doing this, and doing it in the same way with the same value. It's pretty tough to see why some backends would fail at this spot and some not ... unless what it is is an intermittent hardware problem. Is there a version of memtest86 that works on your machine? regards, tom lane
Tom Lane wrote: > Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > >>the machine had some issues a week ago or so - but it looks like the >>problem occured first here: > > > Hmm, what kind of issues, and are you sure they are fixed? admin error :-). I had a second postgresql instance running causing the buildfarm runs to report a failure because of shared memory-limits. Had nothing to do with hardware problems though ... > > The stack trace looks to me like it is trying to apply the PGTZ setting > that's coming in from the client during startup. Now I could believe a > platform-specific breakage there from Magnus' recent hacking, but the > problem is that *every* backend launched during the regression tests > is going to be doing this, and doing it in the same way with the same > value. It's pretty tough to see why some backends would fail at this > spot and some not ... unless what it is is an intermittent hardware > problem. Is there a version of memtest86 that works on your machine? memtest86 works on x86 CPU's only afaik - no ? anyway - here is the promised backtrace: #0 0x000000004489fba4 in memcpy () from /usr/lib/libc.so.34.2 #1 0x0000000000326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90, action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653 #2 0x00000000003434f0 in pg_tzset (name=0xa5ff90 "PST8PDT") at pgtz.c:1039 #3 0x00000000001fbcf0 in assign_timezone (value=0xa5ff90 "PST8PDT", doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351 #4 0x0000000000330f28 in set_config_option (name=0xa52830 "timezone", value=0xa52870 "PST8PDT", context=10645504, source=PGC_S_CLIENT, isLocal=0 '\0', changeVal=1 '\001') at guc.c:3748 #5 0x000000000029b5f0 in PostgresMain (argc=4, argv=0xa52928, username=0xa52740 "mastermind") at postgres.c:2759 #6 0x000000000026b180 in BackendRun (port=0xa77800) at postmaster.c:2800 #7 0x000000000026aa30 in BackendStartup (port=0xa77800) at postmaster.c:2440 #8 0x00000000002685fc in ServerLoop () at postmaster.c:1221 #9 0x0000000000267bc8 in PostmasterMain (argc=0, argv=0xffffffffffff57d8) at postmaster.c:930 #10 0x0000000000220cc8 in main (argc=6, argv=0xffffffffffff57d8) at main.c:268 Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > anyway - here is the promised backtrace: > #0 0x000000004489fba4 in memcpy () from /usr/lib/libc.so.34.2 > #1 0x0000000000326f9c in hash_search (hashp=0xa6e030, keyPtr=0xa5ff90, > action=HASH_ENTER, foundPtr=0x0) at dynahash.c:653 > #2 0x00000000003434f0 in pg_tzset (name=0xa5ff90 "PST8PDT") at pgtz.c:1039 > #3 0x00000000001fbcf0 in assign_timezone (value=0xa5ff90 "PST8PDT", > doit=1 '\001', source=PGC_S_CLIENT) at variable.c:351 > #4 0x0000000000330f28 in set_config_option (name=0xa52830 "timezone", > value=0xa52870 "PST8PDT", context=10645504, source=PGC_S_CLIENT, > isLocal=0 '\0', > changeVal=1 '\001') at guc.c:3748 > #5 0x000000000029b5f0 in PostgresMain (argc=4, argv=0xa52928, > username=0xa52740 "mastermind") at postgres.c:2759 Yeah, that's exactly what I thought it was doing: setting timezone PST8PDT due to the client PGTZ environment variable. So the question now is why this is an intermittent failure. Why doesn't every single regression test crash in the same place? regards, tom lane
On Sat, 18 Jun 2005, Tom Lane wrote: > I've attached the 2PC patch as applied in case you want to have a look. > I did some fairly significant hacking on it, and I thought it'd be a > good idea to enumerate some of the changes: > > > I modified the in-memory data structure so that there is a separate > dummy PGPROC for each prepared transaction (GXACT), and we control > the visibility of the transaction to TransactionIdIsInProgress by > inserting or removing the PGPROC in the global ProcArray. This costs > a little space but it has a lot of benefits: > > (..list of benefits..) Yep, that looks much better. In general, a prepared transaction now looks and behaves more like normal active transactions. That leaves less room for error. > Another area that I changed was fsyncing of prepared transaction status > files. I really didn't like expecting checkpoint to do this --- for one > thing there are race conditions associated with fsyncing a status file > that someone else is busy deleting. (You could possibly avoid that by > having the checkpointer hold the TwoPhaseStateLock while it's fsyncing, > but the performance implications of that are unpleasant.) The > checkpoint coding "dealt with" the race condition by ignoring *all* > open() failures, which is hardly what I'd call a robust solution. > > The way it's done in the committed patch, the prepare sequence is: > 1. Write the status file, append a deliberately bad CRC, > and fsync it. > 2. Write the WAL record and fsync it. > 3. Overwrite the bad CRC with a good one and fsync. > It's annoying to have 3 fsyncs here; 2 would be nicer; but from a safety > perspective I don't think there's much choice. Steps 2 and 3 form a > critical section: if we complete 2 but fail at 3 we have to PANIC > because on-disk state is not consistent with what it says in WAL. So > I felt we had to do the full pushup in step 1 to be as certain as we > can possibly be that step 3 won't fail. > > I'm not totally satisfied with this --- it's OK from a correctness > standpoint but the performance leaves something to be desired. Ouch, that really hurts performance. In typical 2PC use, the state files live for a very short period of time, just long enough for the transaction manager to prepare all the resource managers participating in the global transaction, and then commit them. We're talking < 1 s. If we let checkpoint to fsync the state files, we would only have to fsync those state files that happen to be alive when the checkpoint comes. And if we fsync the state files at the end of the checkpoint, after all the usual heap pages etc, it's very likely that even those rare state files that were alive when the checkpoint began, have already been deleted. So we're really talking about 1 fsync vs. 3 fsyncs. Can we figure out another way to solve the race condition? Would it in fact be ok for the checkpointer to hold the TwoPhaseStateLock, considering that it usually wouldn't be held for long, since usually the checkpoint would have very little work to do? Was there other reasons beside the race condition why you did this way? > I also fooled around a bit with locking of GXACTs. In the original > patch you prevent two backends from trying to commit/rollback the same > GXACT by removing it from the pool at the start of the process. > That's not workable anymore, so what I did was to add a field > "locking_xid" showing the XID of the transaction currently working on > the GXACT. As long as this XID is active according to ProcArray > (ignoring the prepared xact itself of course) the GXACT cannot be > touched by anyone else. This provides a convenient way to interlock > creation of a GXACT as well as its eventual destruction. Ok. > I also changed the handling of smgr pending deletions. The original > patch made smgr a 2PC resource manager, but that does not work because > XLOG replay doesn't (and shouldn't I think) run the resource managers; > thus replay of a PREPARE/COMMIT sequence would fail to delete whatever > files should have been deleted. The correct way to do this is that > COMMIT or ROLLBACK PREPARED must emit a WAL record that looks exactly > like a normal COMMIT or ABORT WAL record, including listing any files > to be deleted. So I pulled the files-to-delete info into the 2PC file > header rather than keeping it in resource records. Ok. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Sat, 18 Jun 2005, Tom Lane wrote: >> I'm not totally satisfied with this --- it's OK from a correctness >> standpoint but the performance leaves something to be desired. > Ouch, that really hurts performance. > In typical 2PC use, the state files live for a very short period of time, > just long enough for the transaction manager to prepare all the resource > managers participating in the global transaction, and then commit them. > We're talking < 1 s. If we let checkpoint to fsync the state files, we > would only have to fsync those state files that happen to be alive when > the checkpoint comes. That's a good point --- I was thinking this was basically 4 fsyncs per xact (counting the additional WAL fsync needed for COMMIT PREPARED) versus 3, but if the average lifetime of a state file is short then it's 4 vs 2, and what's more the 2 are on WAL, which should be way cheaper than fsyncing random files. > And if we fsync the state files at the end of the > checkpoint, after all the usual heap pages etc, it's very likely that > even those rare state files that were alive when the checkpoint began, > have already been deleted. That argument is bogus, at least with respect to the way you were doing it in the original patch, because what you were fsyncing was whatever existed when CheckPointTwoPhase() started. It could however be interesting if we actually implemented something that checked the age of the prepared xact. > Can we figure out another way to solve the race condition? Would it > in fact be ok for the checkpointer to hold the TwoPhaseStateLock, > considering that it usually wouldn't be held for long, since usually the > checkpoint would have very little work to do? If you're concerned about throughput of 2PC xacts then we can't sit on the TwoPhaseStateLock while doing I/O; that will block both preparation and commital of all 2PC xacts for a pretty long period in CPU terms. Here's a sketch of an idea inspired by your comment above: 1. In each gxact in shared memory, store the WAL offset of the PREPARE record, which we will know before we are ready to mark the gxact "valid". 2. When CheckPointTwoPhase runs (which we'll put near the end of the checkpoint sequence), the only gxacts that need to be fsync'd are those that are marked valid and have a PREPARE WAL location older than the checkpoint's redo horizon (anything newer will be replayed from WAL on crash, so it doesn't need fsync to complete the checkpoint). If you're right that the lifespan of a state file is often shorter than the time needed for a checkpoint, this wins big. In any case we'll never have to fsync state files that disappear before the next checkpoint. 3. One way to handle CheckPointTwoPhase is: * At start, take TwoPhaseStateLock (can be in shared mode) for just long enough to scan the gxact list and make a list of the XID of things that need fsync per above rule. * Without the lock, try to open and fsync each item in the list. Success: remove from list ENOENT failure on open: add to list of not-there failures Any other failure: ereport(ERROR) * If the failure list is not empty, again take TwoPhaseStateLock in shared mode, and check that each of the failures is now gone (or at least marked invalid); if so it's OK, otherwise ereport the ENOENT error. Another possibility is to further extend the locking protocol for gxacts so that the checkpointer can lock just the item it is fsyncing (which is not possible at the moment because the checkpointer hasn't got an XID, but probably we could think of another approach). But that would certainly delay attempts to commit the item being fsync'd, whereas the above approach might not have to do so, depending on the filesystem implementation. Now there's a small problem with this approach, which is that we cannot store the PREPARE WAL record location in the state files, since the state file has to be completely computed before writing the WAL record. However, we don't really need to do that: during recovery of a prepared xact we know the thing has been fsynced (either originally, or when we rewrote it during the WAL recovery sequence --- we can force an immediate fsync in that one case). So we can just put zero, or maybe better the current end-of-WAL location, into the reconstructed gxact in memory. Thoughts? regards, tom lane
On Sat, 18 Jun 2005, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> Can we figure out another way to solve the race condition? Would it >> in fact be ok for the checkpointer to hold the TwoPhaseStateLock, >> considering that it usually wouldn't be held for long, since usually the >> checkpoint would have very little work to do? > > If you're concerned about throughput of 2PC xacts then we can't sit on > the TwoPhaseStateLock while doing I/O; that will block both preparation > and commital of all 2PC xacts for a pretty long period in CPU terms. > > Here's a sketch of an idea inspired by your comment above: > > 1. In each gxact in shared memory, store the WAL offset of the PREPARE > record, which we will know before we are ready to mark the gxact > "valid". > > 2. When CheckPointTwoPhase runs (which we'll put near the end of the > checkpoint sequence), the only gxacts that need to be fsync'd are those > that are marked valid and have a PREPARE WAL location older than the > checkpoint's redo horizon (anything newer will be replayed from WAL on > crash, so it doesn't need fsync to complete the checkpoint). If you're > right that the lifespan of a state file is often shorter than the time > needed for a checkpoint, this wins big. In any case we'll never have to > fsync state files that disappear before the next checkpoint. > > 3. One way to handle CheckPointTwoPhase is: > > * At start, take TwoPhaseStateLock (can be in shared mode) for just long > enough to scan the gxact list and make a list of the XID of things that > need fsync per above rule. > > * Without the lock, try to open and fsync each item in the list. > Success: remove from list > ENOENT failure on open: add to list of not-there failures > Any other failure: ereport(ERROR) > > * If the failure list is not empty, again take TwoPhaseStateLock in > shared mode, and check that each of the failures is now gone (or at > least marked invalid); if so it's OK, otherwise ereport the ENOENT > error. In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is marked as valid after the prepare record is written to WAL. If checkpoint runs after the WAL record is written but before the gxact is marked as valid, it doesn't get fsynced. Right? Otherwise, looks good to me. > Another possibility is to further extend the locking protocol for gxacts > so that the checkpointer can lock just the item it is fsyncing (which is > not possible at the moment because the checkpointer hasn't got an XID, > but probably we could think of another approach). But that would > certainly delay attempts to commit the item being fsync'd, whereas the > above approach might not have to do so, depending on the filesystem > implementation. The above sketch is much better. > Now there's a small problem with this approach, which is that we cannot > store the PREPARE WAL record location in the state files, since the > state file has to be completely computed before writing the WAL record. > However, we don't really need to do that: during recovery of a prepared > xact we know the thing has been fsynced (either originally, or when we > rewrote it during the WAL recovery sequence --- we can force an > immediate fsync in that one case). So we can just put zero, or maybe > better the current end-of-WAL location, into the reconstructed gxact in > memory. This reminds me of something. What should we do about XID wraparounds and prepared transactions? Should we have some mechanism to freeze prepared transactions, like heap tuples? At the minimum, I think we should issue a warning if the xid counter approaches the oldest prepared transaction. A transaction shouldn't live that long in normal use, but I can imagine an orphaned transaction sitting there for years if it doesn't hold any locks etc that bother other applications. I don't think we should implement heuristic commit/rollback, though. That creates a whole new class of problems. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > In step 3.1, is it safe to skip gxacts not marked as valid? The gxact is > marked as valid after the prepare record is written to WAL. If checkpoint > runs after the WAL record is written but before the gxact is marked as > valid, it doesn't get fsynced. Right? It's safe because we have the CheckpointStartLock: no checkpoint occurring in that interval can be trying to set its redo pointer after the PREPARE record. This is essentially the same race condition previously detected for COMMIT vs. writing clog, and we need the interlock with either approach to fsyncing 2PC. > This reminds me of something. What should we do about XID wraparounds and > prepared transactions? I don't think we need any special protection. The prepared xacts will be blocking advancement of GlobalXmin, and the (new in 8.1) XID wrap detection code will shut us down safely. > A transaction shouldn't live that long in normal use, but I can imagine an > orphaned transaction sitting there for years if it doesn't hold any locks > etc that bother other applications. No, because people will notice the effects on VACUUM if nothing else. > I don't think we should implement heuristic commit/rollback, though. > That creates a whole new class of problems. Agreed. I did put in the time-of-prepare display in pg_prepared_xacts, so anyone who really wants this can program the behavior they want from outside the database. regards, tom lane