Thread: Two-phase commit issues

Two-phase commit issues

From
Tom Lane
Date:
I've started to look seriously at Heikki's patch for two-phase commit.
There are a few issues that probably deserve discussion:

* The major missing issue that I've come across so far is that
subtransaction and multixact state isn't preserved across a crash.
Assuming that we want to store only top-level XIDs in the shared-memory
list of prepared XIDs (which I think is important), it is essential that
crash restart rebuild the pg_subxact status for prepared transactions.
The subxacts of a prepared xact have to be seen as still running, and
they won't be unless the subxact links are there.  Since subxact.c is
designed to wipe all its state on restart, we need to recreate those
entries.  Fortunately this doesn't seem hard: the state file for a
prepared xact will include all of its subxact XIDs, and we can just
do SubTransSetParent() on them while rereading the state file.  (AFAICS
it's sufficient to make each subxact link directly to the top XID, even
if there was a more complex hierarchy originally.)  Similarly, we've got
to reconstruct MultiXactIds that any prepared xacts are members of, else
row-level locks taken out by prepared xacts won't be enforced correctly.
I think this can be handled if we add to the state files a list of all
MultiXactIds that each prepared xact belongs to, and then during restart
forcibly recreate those MultiXactIds.  (They would only be rebuilt with
prepared XIDs, not any ordinary XIDs that might originally have been
members.)  This seems to require some new code in multixact.c, but not
anything fundamentally difficult --- Alvaro, do you see any likely
problems in this stuff?

* The patch is designed to dump state files into WAL as well as onto
disk.  Why?  Wouldn't it be better just to write and fsync the state
file before reporting successful prepare?  That would get rid of the
need for checkpoint-time fsyncs.

* I'm inclined to think that the "gid" identifiers for prepared
transactions ought to be SQL identifiers (names), not string literals.
Was there a particular reason for making them strings?

* What are we going to do with GUC variables?  My feeling is that
the only sane answer is that PREPARE is the same as COMMIT as far as
local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
on GUC state.  Otherwise it's really unclear what to do.  ConsiderSET myvar = foo;BEGIN;SET myvar = bar;PREPARE
gid;SHOWmyvar;        -- what do you see ... foo or bar?SET myvar = baz;    -- is this even legal?ROLLBACK PREPARED
gid;SHOWmyvar;        -- now what do you see ... foo or baz?
 
Since local GUC changes aren't going to be saved/restored across a
crash anyway, I can't see a point in doing anything really complex.

* There are some fairly ugly cases associated with creation and deletion
of temporary tables as well.  I think we might want to just decree that
you can't PREPARE a transaction that included creating or dropping a
temp table.  Does anyone have much of a problem with that?
        regards, tom lane


Re: Two-phase commit issues

From
"Joe Chang"
Date:
Hi,

One thing I would suggest is to start a global transaction in begin, not in
prepare. That is way to be compliance with XA.

Thanks
Joe


On 5/18/05 2:15 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> I've started to look seriously at Heikki's patch for two-phase commit.
> There are a few issues that probably deserve discussion:
> 
> * The major missing issue that I've come across so far is that
> subtransaction and multixact state isn't preserved across a crash.
> Assuming that we want to store only top-level XIDs in the shared-memory
> list of prepared XIDs (which I think is important), it is essential that
> crash restart rebuild the pg_subxact status for prepared transactions.
> The subxacts of a prepared xact have to be seen as still running, and
> they won't be unless the subxact links are there.  Since subxact.c is
> designed to wipe all its state on restart, we need to recreate those
> entries.  Fortunately this doesn't seem hard: the state file for a
> prepared xact will include all of its subxact XIDs, and we can just
> do SubTransSetParent() on them while rereading the state file.  (AFAICS
> it's sufficient to make each subxact link directly to the top XID, even
> if there was a more complex hierarchy originally.)  Similarly, we've got
> to reconstruct MultiXactIds that any prepared xacts are members of, else
> row-level locks taken out by prepared xacts won't be enforced correctly.
> I think this can be handled if we add to the state files a list of all
> MultiXactIds that each prepared xact belongs to, and then during restart
> forcibly recreate those MultiXactIds.  (They would only be rebuilt with
> prepared XIDs, not any ordinary XIDs that might originally have been
> members.)  This seems to require some new code in multixact.c, but not
> anything fundamentally difficult --- Alvaro, do you see any likely
> problems in this stuff?
> 
> * The patch is designed to dump state files into WAL as well as onto
> disk.  Why?  Wouldn't it be better just to write and fsync the state
> file before reporting successful prepare?  That would get rid of the
> need for checkpoint-time fsyncs.
> 
> * I'm inclined to think that the "gid" identifiers for prepared
> transactions ought to be SQL identifiers (names), not string literals.
> Was there a particular reason for making them strings?
> 
> * What are we going to do with GUC variables?  My feeling is that
> the only sane answer is that PREPARE is the same as COMMIT as far as
> local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
> on GUC state.  Otherwise it's really unclear what to do.  Consider
> SET myvar = foo;
> BEGIN;
> SET myvar = bar;
> PREPARE gid;
> SHOW myvar;  -- what do you see ... foo or bar?
> SET myvar = baz; -- is this even legal?
> ROLLBACK PREPARED gid;
> SHOW myvar;  -- now what do you see ... foo or baz?
> Since local GUC changes aren't going to be saved/restored across a
> crash anyway, I can't see a point in doing anything really complex.
> 
> * There are some fairly ugly cases associated with creation and deletion
> of temporary tables as well.  I think we might want to just decree that
> you can't PREPARE a transaction that included creating or dropping a
> temp table.  Does anyone have much of a problem with that?
> 
> regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
> 




Re: Two-phase commit issues

From
Alvaro Herrera
Date:
On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
> I've started to look seriously at Heikki's patch for two-phase commit.

Hum.  I started a few days ago doing some reviewing, with the intention
of correcting some things here and there in order to present it all to
you later, with a pre-filter to get some bugs out.

> There are a few issues that probably deserve discussion:
> 
> * The major missing issue that I've come across so far is that
> subtransaction and multixact state isn't preserved across a crash.
[...]
> (AFAICS it's sufficient to make each subxact link directly to the top
> XID, even if there was a more complex hierarchy originally.)

Right, we don't care about the hierarchy; we know all those subXids were
committed.

> Similarly, we've got to reconstruct MultiXactIds that any prepared
> xacts are members of, else row-level locks taken out by prepared xacts
> won't be enforced correctly.  I think this can be handled if we add to
> the state files a list of all MultiXactIds that each prepared xact
> belongs to, and then during restart forcibly recreate those
> MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
> ordinary XIDs that might originally have been members.)  This seems to
> require some new code in multixact.c, but not anything fundamentally
> difficult --- Alvaro, do you see any likely problems in this stuff?

I'm not sure if it affects in any way that a Xid=1, which participates
in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
participates in the same MultiXactId; if Xid=1 is prepared later, the
MultiXactId needs to be restored with both Xids as participants.


> * The patch is designed to dump state files into WAL as well as onto
> disk.  Why?  Wouldn't it be better just to write and fsync the state
> file before reporting successful prepare?  That would get rid of the
> need for checkpoint-time fsyncs.

I made the same observation.

> * I'm inclined to think that the "gid" identifiers for prepared
> transactions ought to be SQL identifiers (names), not string literals.
> Was there a particular reason for making them strings?

Ditto.

> * There are some fairly ugly cases associated with creation and deletion
> of temporary tables as well.  I think we might want to just decree that
> you can't PREPARE a transaction that included creating or dropping a
> temp table.  Does anyone have much of a problem with that?

Does this affect any of the other things that use the direct-fsync-no-WAL
path in the smgr?

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)


Re: Two-phase commit issues

From
Tom Lane
Date:
Alvaro Herrera <alvherre@surnet.cl> writes:
> On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
>> Similarly, we've got to reconstruct MultiXactIds that any prepared
>> xacts are members of, else row-level locks taken out by prepared xacts
>> won't be enforced correctly.  I think this can be handled if we add to
>> the state files a list of all MultiXactIds that each prepared xact
>> belongs to, and then during restart forcibly recreate those
>> MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
>> ordinary XIDs that might originally have been members.)

> I'm not sure if it affects in any way that a Xid=1, which participates
> in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
> participates in the same MultiXactId; if Xid=1 is prepared later, the
> MultiXactId needs to be restored with both Xids as participants.

What I had in mind was that each prepared xact's state file would just
list the MultiXactIds it belongs to.  While re-reading the state files
after a crash, we'd construct the opposite lists (ie, which xacts belong
to each MultiXactId) and then write appropriate entries into
pg_multixact at completion of the re-read.  I don't think it matters
what order the xacts got prepared in.

>> * There are some fairly ugly cases associated with creation and deletion
>> of temporary tables as well.  I think we might want to just decree that
>> you can't PREPARE a transaction that included creating or dropping a
>> temp table.  Does anyone have much of a problem with that?

> Does this affect any of the other things that use the direct-fsync-no-WAL
> path in the smgr?

I don't think so.  It's not fsync that is at issue, really --- what I'm
concerned about is operations that occur at commit time.  For instance,
considerCREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS;BEGIN;DROP TABLE foo;PREPARE gid;
foo has an entry in the backend's on-commit-actions list, which the DROP
marks for deletion at commit.  It is unclear what to do with that entry
at prepare.  We can't really leave it active since the table can't be
touched afterwards (the DROP took an exclusive lock, which we no longer
own).  But simply forgetting it is not good either; what if the prepared
xact is later rolled back?  Worse, what if some other backend does that
rollback?  If it was us doing the ROLLBACK PREPARED, we could at least
in theory resurrect the ON COMMIT item, but there's no communication
path to tell us to do so when someone else does the rollback.

More generally, anything like this implies that a transaction that is no
longer ours is holding locks on our temp tables.  This is Really Bad.
(Consider what happens if our backend tries to exit --- it'll want to
delete those temp tables.)  I am more than half tempted to put some kind
of test into LockPersistAll to reject attempts to persist any lock of
any kind on a temp table.

I suppose the ideal solution would be something like what I was just
suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT,
and none of the resources associated with the temp table get assigned
to the prepared xact.  I am not sure how do-able that is, though.
        regards, tom lane


Re: Two-phase commit issues

From
Alvaro Herrera
Date:
On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@surnet.cl> writes:
> > On Wed, May 18, 2005 at 05:15:09PM -0400, Tom Lane wrote:
> >> Similarly, we've got to reconstruct MultiXactIds that any prepared
> >> xacts are members of, else row-level locks taken out by prepared xacts
> >> won't be enforced correctly.  I think this can be handled if we add to
> >> the state files a list of all MultiXactIds that each prepared xact
> >> belongs to, and then during restart forcibly recreate those
> >> MultiXactIds.  (They would only be rebuilt with prepared XIDs, not any
> >> ordinary XIDs that might originally have been members.)
> 
> > I'm not sure if it affects in any way that a Xid=1, which participates
> > in a MultiXactId is seen as not prepared when Xid=2 prepares, which also
> > participates in the same MultiXactId; if Xid=1 is prepared later, the
> > MultiXactId needs to be restored with both Xids as participants.
> 
> What I had in mind was that each prepared xact's state file would just
> list the MultiXactIds it belongs to.

Hm, this assumes the transaction knows what MultiXactIds it belongs to.
This is not true, is it?  I'm not sure how to find that out.

> >> * There are some fairly ugly cases associated with creation and deletion
> >> of temporary tables as well.  I think we might want to just decree that
> >> you can't PREPARE a transaction that included creating or dropping a
> >> temp table.  Does anyone have much of a problem with that?
> 
> > Does this affect any of the other things that use the direct-fsync-no-WAL
> > path in the smgr?
> 
> I don't think so.  It's not fsync that is at issue, really --- what I'm
> concerned about is operations that occur at commit time.

I think I confused the issue with using local buffers instead of shared
buffers, for example where btree creation skips WAL.  But certainly it
doesn't have anything to do with it.


> For instance, consider
>     CREATE TEMP TABLE foo (...) ON COMMIT DELETE ROWS;
>     BEGIN;
>     DROP TABLE foo;
>     PREPARE gid;
> foo has an entry in the backend's on-commit-actions list, which the DROP
> marks for deletion at commit.  It is unclear what to do with that entry
> at prepare.  We can't really leave it active since the table can't be
> touched afterwards (the DROP took an exclusive lock, which we no longer
> own).  But simply forgetting it is not good either; what if the prepared
> xact is later rolled back?  Worse, what if some other backend does that
> rollback?  If it was us doing the ROLLBACK PREPARED, we could at least
> in theory resurrect the ON COMMIT item, but there's no communication
> path to tell us to do so when someone else does the rollback.

Hmm.  I think not being able to use temp tables is an important
restriction.  I can see the implementation issue though.

> More generally, anything like this implies that a transaction that is no
> longer ours is holding locks on our temp tables.  This is Really Bad.
> (Consider what happens if our backend tries to exit --- it'll want to
> delete those temp tables.)  I am more than half tempted to put some kind
> of test into LockPersistAll to reject attempts to persist any lock of
> any kind on a temp table.

Maybe the restriction could be lighter -- what if the prepared
transaction inserts tuples on a temp table, for example.  It's
inconsistent, I think, that a temp table could have tuples on it that
suddenly appear when some other backend commits my prepared transaction.


> I suppose the ideal solution would be something like what I was just
> suggesting for GUC: as far as temp tables go, a PREPARE is a COMMIT,
> and none of the resources associated with the temp table get assigned
> to the prepared xact.  I am not sure how do-able that is, though.

That'd require labelling tuples in temp tables with a different Xid,
than non-temp tables, no?  It'd get strange very quickly.

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"La persona que no quería pecar / estaba obligada a sentarseen duras y empinadas sillas    / desprovistas, por ciertode
blandosatenuantes"                          (Patricio Vogel)
 


Re: Two-phase commit issues

From
Tom Lane
Date:
Alvaro Herrera <alvherre@surnet.cl> writes:
> On Wed, May 18, 2005 at 07:29:38PM -0400, Tom Lane wrote:
>> What I had in mind was that each prepared xact's state file would just
>> list the MultiXactIds it belongs to.

> Hm, this assumes the transaction knows what MultiXactIds it belongs to.
> This is not true, is it?  I'm not sure how to find that out.

[ thinks about that for a bit... ]  I had been thinking we could just
track it locally in each backend, but that won't do for the case where
someone adds you to a MultiXactId without your knowledge.  Seems like
we'd have to actually scan the contents of pg_multixact?  Yech.

> Maybe the restriction could be lighter -- what if the prepared
> transaction inserts tuples on a temp table, for example.  It's
> inconsistent, I think, that a temp table could have tuples on it that
> suddenly appear when some other backend commits my prepared transaction.

Yeah, there are all sorts of interesting problems there :-(.  I think
we'd be best off to punt for the moment.  I think we could enforce that a
transaction being PREPAREd hasn't touched any temp tables at all, by
checking that it holds no locks on such tables.
        regards, tom lane


Re: Two-phase commit issues

From
Heikki Linnakangas
Date:
On Wed, 18 May 2005, Tom Lane wrote:

> * The major missing issue that I've come across so far is that
> subtransaction and multixact state isn't preserved across a crash.
> Assuming that we want to store only top-level XIDs in the shared-memory
> list of prepared XIDs (which I think is important), it is essential that
> crash restart rebuild the pg_subxact status for prepared transactions.
> The subxacts of a prepared xact have to be seen as still running, and
> they won't be unless the subxact links are there.  Since subxact.c is
> designed to wipe all its state on restart, we need to recreate those
> entries.  Fortunately this doesn't seem hard: the state file for a
> prepared xact will include all of its subxact XIDs, and we can just
> do SubTransSetParent() on them while rereading the state file.  (AFAICS
> it's sufficient to make each subxact link directly to the top XID, even
> if there was a more complex hierarchy originally.)  Similarly, we've got
> to reconstruct MultiXactIds that any prepared xacts are members of, else
> row-level locks taken out by prepared xacts won't be enforced correctly.
> I think this can be handled if we add to the state files a list of all
> MultiXactIds that each prepared xact belongs to, and then during restart
> forcibly recreate those MultiXactIds.  (They would only be rebuilt with
> prepared XIDs, not any ordinary XIDs that might originally have been
> members.)  This seems to require some new code in multixact.c, but not
> anything fundamentally difficult --- Alvaro, do you see any likely
> problems in this stuff?

The subtransaction part is in fact there already, and it's done just like 
you described. RecoverPreparedTransactions function reads the subxids from 
the state file and calls SubTransSetParent for them.

As Alvaro pointed out elsewhere, the multixacts are harder because a 
backend doesn't know which multixactids it belongs to. AFAICS, the most 
straightforward solution is to xlog every CreateMultixact call, so that 
the multixact slru files can be completely reconstructed on recovery.

> * The patch is designed to dump state files into WAL as well as onto
> disk.  Why?  Wouldn't it be better just to write and fsync the state
> file before reporting successful prepare?  That would get rid of the
> need for checkpoint-time fsyncs.

Performance and correctness. There mustn't be a valid state file on the 
disk before the WAL entries of that transactions are on disk. Otherwise, 
the recovery might recover a transaction that in fact aborted right after 
it wrote the state file.

If we fsync the WAL prepare record first, and state file second, a crash 
in between would make it impossible to recover the transaction though the 
WAL says it's prepared.

WAL logging the state file completely saves us one fsync. The state files 
are usually small, say < 1 kb, so the tradeoff to write it twice and save 
one fsync is probably well worth it.

Third, we have to cater for PITR. I haven't given it much thought, but if 
we want to do log shipping and PITR, I believe we must have everything in 
the WAL.

> * I'm inclined to think that the "gid" identifiers for prepared
> transactions ought to be SQL identifiers (names), not string literals.
> Was there a particular reason for making them strings?

Sure. No Reason. While you're at it, do you think it's possible to make it 
unlimited size? I couldn't think of a simple way.

> * What are we going to do with GUC variables?  My feeling is that
> the only sane answer is that PREPARE is the same as COMMIT as far as
> local GUC variables go, and COMMIT/ROLLBACK PREPARED have no effect
> on GUC state.  Otherwise it's really unclear what to do.  Consider
>     SET myvar = foo;
>     BEGIN;
>     SET myvar = bar;
>     PREPARE gid;
>     SHOW myvar;        -- what do you see ... foo or bar?
>     SET myvar = baz;    -- is this even legal?
>     ROLLBACK PREPARED gid;
>     SHOW myvar;        -- now what do you see ... foo or baz?
> Since local GUC changes aren't going to be saved/restored across a
> crash anyway, I can't see a point in doing anything really complex.
>
> * There are some fairly ugly cases associated with creation and deletion
> of temporary tables as well.  I think we might want to just decree that
> you can't PREPARE a transaction that included creating or dropping a
> temp table.  Does anyone have much of a problem with that?

I think the safest way to handle the GUC case as well is to just refuse to 
prepare a transaction that has changed local GUC variables.

Another possibility is to rethink the contract of PREPARE TRANSACTION and 
COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to 
a state where you can't do anything else than COMMIT/ROLLBACK the prepared 
transaction, we could do more sensible things with GUC and temp tables. 
That would have complications of it's own though. What would happen if 
another backend then tries to COMMIT/ROLLBACK the transaction the original 
backend is still tied to?

- Heikki


Re: Two-phase commit issues

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> As Alvaro pointed out elsewhere, the multixacts are harder because a 
> backend doesn't know which multixactids it belongs to. AFAICS, the most 
> straightforward solution is to xlog every CreateMultixact call, so that 
> the multixact slru files can be completely reconstructed on recovery.

I realized this morning that in fact it *can't* know that, since even
after a particular transaction commits it's still possible for others to
add it to new multixacts.  In the case of a prepared xact it would
continue to get added to new multixacts indefinitely :-(.  So the idea
of recording info about this in the state files is clearly a loser.
I think we will indeed have to start xlogging multixact operations.

> Third, we have to cater for PITR. I haven't given it much thought, but if 
> we want to do log shipping and PITR, I believe we must have everything in 
> the WAL.

Hmm.  All your other arguments for WAL-logging a prepare are bogus, but
this one seems real.  (It's also a reason why multixact stuff needs to
be xlogged, I guess.)

>> * I'm inclined to think that the "gid" identifiers for prepared
>> transactions ought to be SQL identifiers (names), not string literals.
>> Was there a particular reason for making them strings?

> Sure. No Reason. While you're at it, do you think it's possible to make it 
> unlimited size? I couldn't think of a simple way.

Actually, one reason for wanting them to be identifiers is so that
there's a principled reason for saying what the max length is ;-)

> I think the safest way to handle the GUC case as well is to just refuse to 
> prepare a transaction that has changed local GUC variables.

That seems unnecessarily restrictive.

> Another possibility is to rethink the contract of PREPARE TRANSACTION and 
> COMMIT/ROLLBACK PREPARED. If PREPARE TRANSACTION would put the backend to 
> a state where you can't do anything else than COMMIT/ROLLBACK the prepared 
> transaction, we could do more sensible things with GUC and temp tables. 
> That would have complications of it's own though. What would happen if 
> another backend then tries to COMMIT/ROLLBACK the transaction the original 
> backend is still tied to?

Yeah, I do not think this is a useful answer.  Allowing the commit to
happen somewhere else and restricting what a prepared xact can do with
temp tables seems much more useful in practice.
        regards, tom lane


Re: Two-phase commit issues

From
Bruce Momjian
Date:
Tom Lane wrote:
> I've started to look seriously at Heikki's patch for two-phase commit.
> There are a few issues that probably deserve discussion:
> 
> * The major missing issue that I've come across so far is that
> subtransaction and multixact state isn't preserved across a crash.

I am a little confused by this.  How does two-phase commit add extra
requirements on crash recovery?  I understand a crashed server might be
involved in a two-phase commit, but doesn't the transaction just roll
back?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Two-phase commit issues

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am a little confused by this.  How does two-phase commit add extra
> requirements on crash recovery?

Uh, that's more or less the entire *POINT*.  Once an open transaction is
prepared, it's supposed to survive a server crash.
        regards, tom lane


Re: Two-phase commit issues

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am a little confused by this.  How does two-phase commit add extra
> > requirements on crash recovery?
> 
> Uh, that's more or less the entire *POINT*.  Once an open transaction is
> prepared, it's supposed to survive a server crash.

Wow.  This is much more than I thought we were going to do.  I thought
if something failed after the prepare we were just going to inform the
administrator and give up.  Becuase you are writing status file to the
disk, it seems you are trying to recover from a crash and roll forward.

What cases would we actually fail to recover from a crash after a
PREPARE?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Two-phase commit issues

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Uh, that's more or less the entire *POINT*.  Once an open transaction is
>> prepared, it's supposed to survive a server crash.

> Wow.  This is much more than I thought we were going to do.

If we tried to claim that anything less was two-phase commit, we'd be
laughed off the face of the planet ...
        regards, tom lane


Re: Two-phase commit issues

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Uh, that's more or less the entire *POINT*.  Once an open transaction is
> >> prepared, it's supposed to survive a server crash.
> 
> > Wow.  This is much more than I thought we were going to do.
> 
> If we tried to claim that anything less was two-phase commit, we'd be
> laughed off the face of the planet ...

Well, based on past discussions, our TODO has:
* Add two-phase commit  This will involve adding a way to respond to commit failure by either  taking the server into
offline/readonlymode or notifying the  administrator
 

As I remember, you said two-phase wasn't 100% reliable and we just
needed a way to report failures.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Two-phase commit issues

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> As I remember, you said two-phase wasn't 100% reliable and we just
> needed a way to report failures.

[ Shrug... ]  I remain of the opinion that 2PC is a solution in search
of a problem, because it does not solve the single point of failure
issue (just moves same from the database to the 2PC controller).
But some people want it anyway, and they aren't going to be satisfied
that we are an "enterprise grade" database until we can check off this
particular bullet point.  As long as the implementation doesn't impose
any significant costs when not being used (which AFAICS Heikki's method
doesn't), I think we gotta hold our noses and do it.
        regards, tom lane


Re: Two-phase commit issues

From
jordan
Date:
Exactly.  A 2PC expects every participant that makes it to the prepare to commit phase to survive a server restart,
controlleror otherwise.  Anything less is not 2PC.  <br /><br /> Jordan Henderson<br /><br /> On Fri, 2005-05-20 at
12:07-0400, Tom Lane wrote: <blockquote type="CITE"><pre>
 
<font color="#000000">Bruce Momjian <<a href="mailto:pgman@candle.pha.pa.us">pgman@candle.pha.pa.us</a>>
writes:</font>
<font color="#000000">> I am a little confused by this.  How does two-phase commit add extra</font>
<font color="#000000">> requirements on crash recovery?</font>

<font color="#000000">Uh, that's more or less the entire *POINT*.  Once an open transaction is</font>
<font color="#000000">prepared, it's supposed to survive a server crash.</font>

<font color="#000000">            regards, tom lane</font>

<font color="#000000">---------------------------(end of broadcast)---------------------------</font>
<font color="#000000">TIP 9: the planner will ignore your desire to choose an index scan if your</font>
<font color="#000000">      joining column's datatypes do not match</font>
</pre></blockquote>

Re: Two-phase commit issues

From
David Garamond
Date:
Tom Lane wrote:
> [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
> of a problem, because it does not solve the single point of failure
> issue (just moves same from the database to the 2PC controller).
> But some people want it anyway, and they aren't going to be satisfied
> that we are an "enterprise grade" database until we can check off this
> particular bullet point.  As long as the implementation doesn't impose
> any significant costs when not being used (which AFAICS Heikki's method
> doesn't), I think we gotta hold our noses and do it.

I thought the primary reason for having 2PC is to be able to participate
in a heterogenous transaction, e.g. with a non-Postgres database/other
types of resource managers? 2PC is mostly about how to make these
cross-RM transactions [appear] atomic. Redundancy is not covered by 2PC
protocol.

--
dave


Re: Two-phase commit issues

From
Josh Berkus
Date:
Tom,

> > [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
> > of a problem, because it does not solve the single point of failure
> > issue (just moves same from the database to the 2PC controller).
> > But some people want it anyway, and they aren't going to be satisfied
> > that we are an "enterprise grade" database until we can check off this
> > particular bullet point.  As long as the implementation doesn't impose
> > any significant costs when not being used (which AFAICS Heikki's method
> > doesn't), I think we gotta hold our noses and do it.

2PC is a key to supporting 3rd-party replication tools, like C-JDBC.   And is
useful for some other use cases, like slow-WAN-based financial transactions.
We know you don't like it, Tom.  ;-)

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Two-phase commit issues

From
José Orlando Pereira
Date:
On Friday 20 May 2005 18:14, Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > As I remember, you said two-phase wasn't 100% reliable and we just
> > needed a way to report failures.
>
> [ Shrug... ]  I remain of the opinion that 2PC is a solution in search
> of a problem, because it does not solve the single point of failure
> issue (just moves same from the database to the 2PC controller).

You're right. 2PC to coordinate replicas of the same data is not that 
interesting. It is however most interesting when coordination updates to 
different objects such as (i) a central database server and a local staging 
area or (ii) a database server and transactional queues in a workflow-style 
app. 

> But some people want it anyway, and they aren't going to be satisfied
> that we are an "enterprise grade" database until we can check off this
> particular bullet point.  As long as the implementation doesn't impose
> any significant costs when not being used (which AFAICS Heikki's method
> doesn't), I think we gotta hold our noses and do it.

It is a definitly in the check list if you're shopping for a database to go 
with your buzzword compliant J2EE app server. :-)

-- 
Jose Orlando Pereira


Re: Two-phase commit issues

From
Heikki Linnakangas
Date:
On Thu, 19 May 2005, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>
>>> * I'm inclined to think that the "gid" identifiers for prepared
>>> transactions ought to be SQL identifiers (names), not string literals.
>>> Was there a particular reason for making them strings?
>
>> Sure. No Reason. While you're at it, do you think it's possible to make it
>> unlimited size? I couldn't think of a simple way.
>
> Actually, one reason for wanting them to be identifiers is so that
> there's a principled reason for saying what the max length is ;-)

I took a closer look at the JTA spec and saw that the Xid, which is 
translated to a gid in the jdbc driver, consists of a format identifier 
(32-bit int), a branch qualifier (max 64 bytes) and a global transaction 
identifier (max 64 bytes).

That means that gid needs to hold 132 raw bytes minimum.

Also, it would be nice if the driver could send the gid as a bytea, 
without converting it to a string. Similar to using parameter markers 
and parse / bind messages with regular queries. That would require a 
change in the FE/BE protocol, right?

The branch qualifier and global transaction id structure comes from 
the OSI CCR specification. Anyone here that knows more about OSI CCR?

- Heikki


Re: Two-phase commit issues

From
José Orlando Pereira
Date:
On Saturday 21 May 2005 03:37, Josh Berkus wrote:
> 2PC is a key to supporting 3rd-party replication tools, like C-JDBC.

I don't think C-JDBC requires 2PC for replication. Mixed up acronyms maybe? :)

-- 
Jose Orlando Pereira


Re: Two-phase commit issues

From
Heikki Linnakangas
Date:
On Tue, 7 Jun 2005, Alvaro Herrera wrote:

> On Sat, May 21, 2005 at 06:57:24PM +0300, Heikki Linnakangas wrote:
>
> Heikki,
>
>> I took a closer look at the JTA spec and saw that the Xid, which is
>> translated to a gid in the jdbc driver, consists of a format identifier
>> (32-bit int), a branch qualifier (max 64 bytes) and a global transaction
>> identifier (max 64 bytes).
>>
>> That means that gid needs to hold 132 raw bytes minimum.
>>
>> Also, it would be nice if the driver could send the gid as a bytea,
>> without converting it to a string. Similar to using parameter markers
>> and parse / bind messages with regular queries. That would require a
>> change in the FE/BE protocol, right?
>>
>> The branch qualifier and global transaction id structure comes from
>> the OSI CCR specification. Anyone here that knows more about OSI CCR?
>
> I think I'm going to try to do this by hacking the lexer some (this has
> the added benefit of me learning a little about lexers).  Do you have an
> URL to those specs you mention?  How authoritative they are, I mean,
> they are not the SQL spec, right?

The JTA spec
http://java.sun.com/products/jta/

Relevant X/Open XA documents:
http://www.opengroup.org/bookstore/catalog/tp.htm

See especially page 19 of the "Distributed Transaction Processing: The XA 
Specification", it contains xa.h header file that specifies the format of 
the transaction identifier.

It matches with the format in the JTA spec, but the JTA spec also mentions 
the OCI CCR format which I haven't been able to find:
http://java.sun.com/products/jta/jta-1_0_1B-doc/javax/transaction/xa/Xid.html

In addition to those two, I bumped into RFC2371. It basically allows 
any format.

I don't have access to the SQL spec, so I can't comment on that. I'd 
regard the XA spec as the most authoritative standard in the field.

- Heikki


Re: Two-phase commit issues

From
Jochem van Dieten
Date:
On 6/11/05, Heikki Linnakangas wrote:
>
> It matches with the format in the JTA spec, but the JTA spec also mentions
> the OCI CCR format

The "OSI" CCR format, which appears to refer to ISO/IEC 9805-1.

ISO/IEC 9805-1:1998
15-12-1998
Information technology - Open Systems Interconnection - Protocol for
the Commitment, Concurrency and Recovery service element: Protocol
specification

This standard is to be applied by reference from other specifications.
Specifies a use of the ACSE, Presentation adn Session services to
carry the CCR semantics. Specifies the static and dynamic conformance
requirements for systems implementing these procedures. Specifies the
protocol elements that support the following functional untis: -
static commitment; - dynamic commitment; - read only; - one-phase
commitment; - cancel; and overlapped recovery.


Unfortunately that standard is not included in my universities
subscription to ISO standards so I can't tell you what it says about
the format.

Jochem


Re: Two-phase commit issues

From
Heikki Linnakangas
Date:
On Sat, 11 Jun 2005, Jochem van Dieten wrote:

> The "OSI" CCR format, which appears to refer to ISO/IEC 9805-1.
>
> ISO/IEC 9805-1:1998
> 15-12-1998
> Information technology - Open Systems Interconnection - Protocol for
> the Commitment, Concurrency and Recovery service element: Protocol
> specification
>
> This standard is to be applied by reference from other specifications.
> Specifies a use of the ACSE, Presentation adn Session services to
> carry the CCR semantics. Specifies the static and dynamic conformance
> requirements for systems implementing these procedures. Specifies the
> protocol elements that support the following functional untis: -
> static commitment; - dynamic commitment; - read only; - one-phase
> commitment; - cancel; and overlapped recovery.
>
>
> Unfortunately that standard is not included in my universities
> subscription to ISO standards so I can't tell you what it says about
> the format.

Great, thanks anyway! Anyone here with access to the content?

- Heikki