Re: Two-phase commit - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Two-phase commit
Date
Msg-id Pine.OSF.4.61.0410071357420.432862@kosh.hut.fi
Whole thread Raw
In response to Re: Two-phase commit  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, 6 Oct 2004, Tom Lane wrote:

> Quite some time ago, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I haven't received any comments and there hasn't been any discussion on
>> the implementation, I suppose that nobody has given it a try. :(
>
> I finally got around to taking a close look at this.  There's a good bit
> undone, as you well know, but it seems like it can be the basis for a
> workable feature.  I do have a few comments to make.

Great!

> At the API level, I like the PREPARE/COMMIT/ROLLBACK statements, but I
> think you have missed a bet in that it needs to be possible to issue
> "COMMIT PREPARED gid" for the same gid several times without error.
> Consider a scenario where the transaction monitor crashes during the
> commit phase.  When it recovers, it will be aware that it had committed
> to commit, but it won't know which nodes were successfully committed.
> So it will need to resend the COMMIT commands.  It would be bad for the
> nodes to simply say "yes boss" if they are told to COMMIT a gid they
> have no record of.  So I think the gid's have to stick around after
> COMMIT PREPARED or ROLLBACK PREPARED, and there needs to be a fourth
> command (RELEASE PREPARED?) to actually remove the state data when the
> transaction monitor is satisfied that everything's done.  RELEASE of
> an unknown gid is okay to be a no-op.

Hmm. I don't see a problem with the "yes boss" approach. Some kind of a 
warning is appropriate, of course, but I don't see a reason for an 
additional step. After all, you would still fall back to the "yes boss" 
approach on the RELEASE PREPARED command.

The transaction monitor knows if the 1st phase succeeded or not, so if the 
COMMIT PREPARED doesn't find the transaction anymore, the monitor knows 
that it's previous commit/rollback succeeded.

> Implementation-wise, I really dislike storing the info in a shared hash
> table, because I don't see any reasonable bound on the size of the hash
> table (your existing code uses 100 which is about as arbitrary as it
> gets).  Plus the actual content of each entry is not fixed-size either.
> This is not very workable given our fixed-size shared memory mechanism.

I fully agree, I'm very dissatisfied with that part.

> The idea that occurs to me instead is to not use WAL or shared memory at
> all for keeping the prepared-transaction state info.  Instead, suppose
> that we store the status information in a file named after the GID,
> "$PGDATA/pg_twophase/gid".  We could write the file with a CRC similarly
> to what's done for pg_control.  Once such a file is written and fsync'd,
> it's equally as reliable as a WAL record would be, so it seems safe
> enough to me to report the PREPARE as done.  COMMIT, ROLLBACK, and the
> pg_prepared_xacts system view would look into the pg_twophase directory
> to find out all about active prepared transactions; RELEASE PREPARED
> would simply delete the appropriate file.  (Note: commit or rollback
> would need to take the transaction XID from the GID file and then look
> in pg_clog to find out if the transaction were already committed.  These
> operations do not change the pg_twophase file, but they do write a
> normal transaction-commit or -abort WAL record and update pg_clog.)

That sounds like a clever idea! I thought about using a single file 
myself, but the multi-file approach is much simpler.

> I think this would offer better performance as well as being more
> scalable, because the implementation you have looks like it would have
> some contention for the shared GID hashtable.

I guess the performance would depend a lot on how good/bad the filesystem 
is at creating and deleting a lot of small files.

> I would be inclined to require GIDs to be numbers (probably int8's)
> instead of strings, so that we don't have any problems with funny
> characters in the file names.  That's negotiable though, as we could
> certainly uuencode the strings or something to avoid that trap.

I'm afraid we have to support arbitrary strings. I think at least the Java 
Transaction API requires that, I'm not sure though if that could be 
worked around in the JDBC driver.

> You were concerned about how to mark prepared transactions in pg_clog,
> given that Alvaro had already commandeered state '11' for
> subtransactions.  Since only a toplevel transaction can be prepared,
> it might work to allow state '11' with a zero pg_subtrans parent link
> to mean a prepared transaction.  This would imply factoring prepared
> XIDs into GlobalXmin (so that pg_subtrans entries don't get recycled
> too soon) but we probably have to do that anyway.  AFAICS, prepared
> but uncommitted XIDs have to be considered still InProgress, so if
> they are less than GlobalXmin we'd lose.

Yes, they must be considered InProgress. The snapshot code needs to be 
modified to handle an arbitrary number of in progress transactions.


I've been thinking if it would be useful to have the COMMIT
PREPARED/ROLLBACK PREPARED commands under transaction control themselves. 
You could for example do "BEGIN; COMMIT PREPARED mygid; COMMIT PREPARED 
mygid2; COMMIT;" to atomically commit two already-prepared transactions, 
and even chain the 2PC transactions like "BEGIN; COMMIT PREPARED mygid; 
PREPARE TRANSACTION mygid2". It seems feasible to implement, just postpone 
the actual 2nd phase commit to the end of the commit of the enclosing 
transaction.

- Heikki


pgsql-hackers by date:

Previous
From: Peter Davie
Date:
Subject: Re: [BUGS] BUG #1270: stack overflow in thread in fe_getauthname
Next
From: "Joshua D. Drake"
Date:
Subject: Re: Required permissions for data directory