Thread: Question concerning XTM (eXtensible Transaction Manager API)

Question concerning XTM (eXtensible Transaction Manager API)

From
Konstantin Knizhnik
Date:
Hello,<br /><br /> Some time ago at PgConn.Vienna we have proposed eXtensible Transaction Manager API (XTM).<br /> The
ideais to be able to provide custom implementation of transaction managers as standard Postgres extensions,<br />
primarygoal is implementation of distritibuted transaction manager.<br /> It should not only support 2PC, but also
provideconsistent snapshots for global transaction executed at different nodes.<br /><br /> Actually, current version
ofXTM API  propose any particular 2PC model. It can be implemented either at coordinator side<br /> (as it is done in
our<a href="https://github.com/postgrespro/pg_tsdtm">pg_tsdtm</a> implementation based on timestamps and not requiring
centralizedarbiter), either by arbiter<br /> (<a href="https://github.com/postgrespro/pg_dtm">pg_dtm</a>). In the last
case2PC logic is hidden under XTM SetTransactionStatus method:<br /><br />      bool
(*SetTransactionStatus)(TransactionIdxid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn);<br
/><br/> which encapsulates TransactionIdSetTreeStatus in clog.c.<br /> But you may notice that original
TransactionIdSetTreeStatusfunction is void - it is not intended to return anything.<br /> It is called in
RecordTransactionCommitin critical section where it is not expected that commit may fail.<br /> But in case of DTM
transactionmay be rejected by arbiter. XTM API allows to control access to CLOG, so everybody will see that transaction
isaborted. But we in any case have to somehow notify client about abort of transaction.<br /><br /> We can not just
callelog(ERROR,...) in SetTransactionStatus implementation because inside critical section it cause Postgres crash with
panicmessage. So we have to remember that transaction is rejected and report error later after exit from critical
section:<br/><br /><br />         /*<br />          * Now we may update the CLOG, if we wrote a COMMIT record above<br
/>         */<br />         if (markXidCommitted) {<br />             committed = TransactionIdCommitTree(xid,
nchildren,children);<br />         }<br /> ...<br />     /*<br />      * If we entered a commit critical section, leave
itnow, and let<br />      * checkpoints proceed.<br />      */<br />     if (markXidCommitted)<br />     {<br />    
   MyPgXact->delayChkpt = false;<br />         END_CRIT_SECTION();<br />         if (!committed) {<br />            
CurrentTransactionState->state= TRANS_ABORT;<br />             CurrentTransactionState->blockState =
TBLOCK_ABORT_PENDING;<br/>             elog(ERROR, "Transaction commit rejected by XTM");<br />         }<br />    
}<br/><br /> There is one more problem - at this moment the state of transaction is TRANS_COMMIT.<br /> If ERROR
handlerwill try to abort it, then we get yet another fatal error: attempt to rollback committed transaction.<br /> So
weneed to hide the fact that transaction is actually committed in local XLOG.<br /><br /> This approach works but looks
alittle bit like hacker approach. It requires not only to replace direct call of TransactionIdSetTreeStatus with
indirect(though XTM API), but also requires  to make some non obvious changes in RecordTransactionCommit.<br /><br />
Sowhat are the alternatives?<br /><br /> 1. Move RecordTransactionCommit to XTM. In this case we have to copy original
RecordTransactionCommitto DTM implementation and patch it here. It is also not nice, because it will complicate
maintenanceof DTM implementation.<br /> The primary idea of XTM is to allow development of DTM as standard PostgreSQL
extensionwithout creating of specific clones of main PostgreSQL source tree. But this idea will be compromised if we
havecopy&paste some pieces of PostgreSQL code.<br /> In some sense it is even worser than maintaining separate
branch- in last case at least we have some way to perfrtom automatic merge.<br /><br /> 2. Propose some alternative
two-phasecommit implementation in PostgreSQL core. The main motivation for such "lightweight" implementation of 2PC in
pg_dtmis that original mechanism of prepared transactions in PostgreSQL adds to much overhead.<br /> In our benchmarks
wehave found that simple credit-debit banking test (without any DTM) works almost 10 times slower with PostgreSQL 2PC
thanwithout it. This is why we try to propose alternative solution (right now pg_dtm is 2 times slower than vanilla
PostgreSQL,but it not only performs 2PC but also provide consistent snapshots).<br /><br /> May be somebody can suggest
someother solution?<br /> Or give some comments concerning current approach?<br /><br /> Thank in advance, <br />
Konstantin,<br /> Postgres Professional<br /><br /> 

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Alvaro Herrera
Date:
Konstantin Knizhnik wrote:

> But you may notice that original TransactionIdSetTreeStatus function is void
> - it is not intended to return anything.
> It is called in RecordTransactionCommit in critical section where it is not
> expected that commit may fail.
> But in case of DTM transaction may be rejected by arbiter. XTM API allows to
> control access to CLOG, so everybody will see that transaction is aborted.
> But we in any case have to somehow notify client about abort of transaction.

I think you'll need to rethink how a transaction commits rather
completely, rather than consider localized tweaks to specific functions.
For one thing, the WAL record about transaction commit has already been
written by XactLogCommitRecord much earlier than calling
TransactionIdCommitTree.  So if you were to crash at that point, it
doesn't matter how much the arbiter has rejected the transaction, WAL
replay would mark it as committed.  Also, what about the replication
origin stuff and the TransactionTreeSetCommitTsData() call?

I think you need to involve the arbiter earlier, so that the commit
process can be aborted earlier than those things.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Konstantin Knizhnik
Date:
<div class="moz-cite-prefix">On 11/16/2015 10:54 PM, Alvaro Herrera wrote:<br /></div><blockquote
cite="mid:20151116195453.GC614468@alvherre.pgsql"type="cite"><pre wrap="">Konstantin Knizhnik wrote:
 

</pre><blockquote type="cite"><pre wrap="">But you may notice that original TransactionIdSetTreeStatus function is
void
- it is not intended to return anything.
It is called in RecordTransactionCommit in critical section where it is not
expected that commit may fail.
But in case of DTM transaction may be rejected by arbiter. XTM API allows to
control access to CLOG, so everybody will see that transaction is aborted.
But we in any case have to somehow notify client about abort of transaction.
</pre></blockquote><pre wrap="">
I think you'll need to rethink how a transaction commits rather
completely, rather than consider localized tweaks to specific functions.
For one thing, the WAL record about transaction commit has already been
written by XactLogCommitRecord much earlier than calling
TransactionIdCommitTree.  So if you were to crash at that point, it
doesn't matter how much the arbiter has rejected the transaction, WAL
replay would mark it as committed. </pre></blockquote><br /> Yes, WAL replay will recover this transaction and try to
markit in CLOG as completed, but ... we have <font face="Arial">caught </font>control over CLOG using XTM.<br /> And
insteadof direct writing to CLOG, DTM will contact arbiter and ask his opinion concerning this transaction. <br /> If
arbiterdoesn't think that it was committed, then it will not be marked as committed in local CLOG.<br /><br /><br
/><blockquotecite="mid:20151116195453.GC614468@alvherre.pgsql" type="cite"><pre wrap=""> Also, what about the
replication
origin stuff and the TransactionTreeSetCommitTsData() call?

I think you need to involve the arbiter earlier, so that the commit
process can be aborted earlier than those things.

</pre></blockquote><br />

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Kevin Grittner
Date:
On Monday, November 16, 2015 2:47 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

> Some time ago at PgConn.Vienna we have proposed eXtensible
>    Transaction Manager API (XTM).
> The idea is to be able to provide custom implementation of
>    transaction managers as standard Postgres extensions,
> primary goal is implementation of distritibuted transaction manager.
> It should not only support 2PC, but also provide consistent
>    snapshots for global transaction executed at different nodes.
>
> Actually, current version of XTM API  propose any particular 2PC
>    model. It can be implemented either at coordinator side
> (as it is done in our pg_tsdtm implementation based on timestamps
> and not requiring centralized arbiter), either by arbiter
> (pg_dtm).

I'm not entirely clear on what you're saying here.  I admit I've
not kept in close touch with the distributed processing discussions
lately -- is there a write-up and/or diagram to give an overview of
where we're at with this effort?

> In the last case 2PC logic is hidden under XTM
> SetTransactionStatus method:
>
>      bool (*SetTransactionStatus)(TransactionId xid, int nsubxids,
>    TransactionId *subxids, XidStatus status, XLogRecPtr lsn);
>
> which encapsulates TransactionIdSetTreeStatus in clog.c.
> But you may notice that original TransactionIdSetTreeStatus function
>    is void - it is not intended to return anything.
> It is called in RecordTransactionCommit in critical section where it
>    is not expected that commit may fail.

This issue, though, seems clear enough.  At some point a
transaction must cross a hard line between when it is not committed
and when it is, since after commit subsequent transactions can then
see the data and modify it.  There has to be some "point of no
return" in order to have any sane semantics.  Entering that critical
section is it.

> But in case of DTM transaction may be rejected by arbiter. XTM API
>    allows to control access to CLOG, so everybody will see that
>    transaction is aborted. But we in any case have to somehow notify
>    client about abort of transaction.

If you are saying that DTM tries to roll back a transaction after
any participating server has entered the RecordTransactionCommit()
critical section, then IMO it is broken.  Full stop.  That can't
work with any reasonable semantics as far as I can see.

> We can not just call elog(ERROR,...) in SetTransactionStatus
>    implementation because inside critical section it cause Postgres
>    crash with panic message. So we have to remember that transaction is
>    rejected and report error later after exit from critical section:

I don't believe that is a good plan.  You should not enter the
critical section for recording that a commit is complete until all
the work for the commit is done except for telling the all the
servers that all servers are ready.

> There is one more problem - at this moment the state of transaction
>    is TRANS_COMMIT.
> If ERROR handler will try to abort it, then we get yet another fatal
>    error: attempt to rollback committed transaction.
> So we need to hide the fact that transaction is actually committed
>    in local XLOG.

That is one of pretty much an infinite variety of problems you have
if you don't have a "hard line" for when the transaction is finally
committed.

> This approach works but looks a little bit like hacker approach. It
>    requires not only to replace direct call of
>    TransactionIdSetTreeStatus with indirect (though XTM API), but also
>    requires  to make some non obvious changes in
>    RecordTransactionCommit.
>
> So what are the alternatives?
>
> 1. Move RecordTransactionCommit to XTM. In this case we have to copy
>    original RecordTransactionCommit to DTM implementation and patch it
>    here. It is also not nice, because it will complicate maintenance of
>    DTM implementation.
> The primary idea of XTM is to allow development of DTM as standard
>    PostgreSQL extension without creating of specific clones of main
>    PostgreSQL source tree. But this idea will be compromised if we have
>    copy&paste some pieces of PostgreSQL code.
> In some sense it is even worser than maintaining separate branch -
>    in last case at least we have some way to perfrtom automatic merge.

You can have a call in XTM that says you want to record the the
commit on all participating servers, but I don't see where that
would involve moving anything we have now out of each participating
server -- it would just need to function like a real,
professional-quality distributed transaction manager doing the
second phase of a two-phase commit.  If any participating server
goes through the first phase and reports that all the heavy lifting
is done, and then is swallowed up in a pyroclastic flow of an
erupting volcano before phase 2 comes around, the DTM must
periodically retry until the administrator cancels the attempt.

> 2. Propose some alternative two-phase commit implementation in
>    PostgreSQL core. The main motivation for such "lightweight"
>    implementation of 2PC in pg_dtm is that original mechanism of
>    prepared transactions in PostgreSQL adds to much overhead.
> In our benchmarks we have found that simple credit-debit banking
>    test (without any DTM) works almost 10 times slower with PostgreSQL
>    2PC than without it. This is why we try to propose alternative
>    solution (right now pg_dtm is 2 times slower than vanilla
>    PostgreSQL, but it not only performs 2PC but also provide consistent
>    snapshots).

Are you talking about 10x the latency on a commit, or that the
overall throughput under saturation load is one tenth of running
without something to guarantee the transactional integrity of the
whole set of nodes?  The former would not be too surprising, while
the latter would be rather amazing.

One interesting coincidence is that I've been told off-list about a
large company which developed their own internal distributed
version of PostgreSQL in which they implemented Serializable
Snapshot Isolation, and reported a 10x performance impact.  They
were reportedly fine with that, since the integrity of their data
was worth it to them and it performed better than the alternatives;
but the number matching like that does make me wonder how hard it
would be to break that barrier in terms of latency, if you really
preserve ACID properties across a distributed system.

Of course, latency and throughput are two completely different
things.

> May be somebody can suggest some other solution?

Well, it seems like it might be possible for some short cuts to be
taken when the part of the distributed transaction run on a
particular server was read-only.  I'm not sure of the details, but
it seems conceptually possible to minimize network latency issues.


> Or give some comments concerning current approach?

I don't see how the DTM approach described could provide anything
resembling ACID guarantees; but if you want to give up all claim to
that, you might be able to provide something useful with a more lax
set of assurances.  (After all, there are some popular products out
there which are pretty "relaxed" about such things.)  In that case
maybe the XTM interface could include some way to query what
guarantees an implementation does provide.  I don't mean this
sarcastically -- there are use cases for different levels of
guarantee here.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Question concerning XTM (eXtensible Transaction Manager API)

From
konstantin knizhnik
Date:
Thank you for your response.


On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:
I'm not entirely clear on what you're saying here.  I admit I've
not kept in close touch with the distributed processing discussions
lately -- is there a write-up and/or diagram to give an overview of
where we're at with this effort?



If you are saying that DTM tries to roll back a transaction after
any participating server has entered the RecordTransactionCommit()
critical section, then IMO it is broken.  Full stop.  That can't
work with any reasonable semantics as far as I can see.

DTM is not trying to rollback committed transaction.
What he tries to do is to hide this commit.
As I already wrote, the idea was to implement "lightweight" 2PC because prepared transactions mechanism in PostgreSQL adds too much overhead and cause soe problems with recovery.

The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.
But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of transaction.
If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally completed.
Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the client.
XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction.
But during recovery it also tries to restore transaction status in CLOG.
And at this placeDTM contacts arbiter to know status of transaction.
If it is marked as aborted in arbiter's CLOG, then it wiull be also marked as aborted in local CLOG.
And according to PostgreSQL visibility rules no other transaction will see changes made by this transaction.




We can not just call elog(ERROR,...) in SetTransactionStatus
  implementation because inside critical section it cause Postgres
  crash with panic message. So we have to remember that transaction is
  rejected and report error later after exit from critical section:

I don't believe that is a good plan.  You should not enter the
critical section for recording that a commit is complete until all
the work for the commit is done except for telling the all the
servers that all servers are ready.

It is good point. 
May be it is the reason of performance scalability problems we have noticed with DTM.

In our benchmarks we have found that simple credit-debit banking
  test (without any DTM) works almost 10 times slower with PostgreSQL
  2PC than without it. This is why we try to propose alternative
  solution (right now pg_dtm is 2 times slower than vanilla
  PostgreSQL, but it not only performs 2PC but also provide consistent
  snapshots).

Are you talking about 10x the latency on a commit, or that the
overall throughput under saturation load is one tenth of running
without something to guarantee the transactional integrity of the
whole set of nodes?  The former would not be too surprising, while
the latter would be rather amazing.

Sorry, some clarification.
We get 10x slowdown of performance caused by 2pc on very heavy load on the IBM system with 256 cores.
At "normal" servers slowdown of 2pc is smaller - about 2x.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Amit Kapila
Date:
On Tue, Nov 17, 2015 at 12:12 PM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Thank you for your response.


On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:
I'm not entirely clear on what you're saying here.  I admit I've
not kept in close touch with the distributed processing discussions
lately -- is there a write-up and/or diagram to give an overview of
where we're at with this effort?



If you are saying that DTM tries to roll back a transaction after
any participating server has entered the RecordTransactionCommit()
critical section, then IMO it is broken.  Full stop.  That can't
work with any reasonable semantics as far as I can see.

DTM is not trying to rollback committed transaction.
What he tries to do is to hide this commit.
As I already wrote, the idea was to implement "lightweight" 2PC because prepared transactions mechanism in PostgreSQL adds too much overhead and cause soe problems with recovery.

The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.
But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of transaction.
If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally completed.
Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the client.
XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction.
But during recovery it also tries to restore transaction status in CLOG.
And at this placeDTM contacts arbiter to know status of transaction.

I think the general idea is that if Commit is WAL logged, then the
operation is considered to committed on local node and commit should
happen on any node, only once prepare from all nodes is successful.
And after that transaction is not supposed to abort.  But I think you are
trying to optimize the DTM in some way to not follow that kind of protocol.
By the way, how will arbiter does the recovery in a scenario where it
crashes, won't it need to contact all nodes for the status of in-progress or
prepared transactions? 
I think it would be better if more detailed design of DTM with respect to
transaction management and recovery could be updated on wiki for having
discussion on this topic.  I have seen that you have already updated many
details of the system, but still the complete picture of DTM is not clear.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Atri Sharma
Date:
<p dir="ltr"><br /> > I think the general idea is that if Commit is WAL logged, then the<br /> > operation is
consideredto committed on local node and commit should<br /> > happen on any node, only once prepare from all nodes
issuccessful.<br /> > And after that transaction is not supposed to abort.  But I think you are<br /> > trying to
optimizethe DTM in some way to not follow that kind of protocol.<br /> > By the way, how will arbiter does the
recoveryin a scenario where it<br /> > crashes, won't it need to contact all nodes for the status of in-progress
or<br/> > prepared transactions? <br /> > I think it would be better if more detailed design of DTM with respect
to<br/> > transaction management and recovery could be updated on wiki for having<br /> > discussion on this
topic. I have seen that you have already updated many<br /> > details of the system, but still the complete picture
ofDTM is not clear.<br /><p dir="ltr">I agree.<p dir="ltr">I have not been following this discussion but from what I
haveread above I think the recovery model in this design is broken. You have to follow some protocol, whichever you
choose.<pdir="ltr">I think you can try using something like Paxos,  if you are looking at a higher reliable model but
don'twant the overhead of 3PC. 

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
konstantin knizhnik
Date:

On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:


I think the general idea is that if Commit is WAL logged, then the
operation is considered to committed on local node and commit should
happen on any node, only once prepare from all nodes is successful.
And after that transaction is not supposed to abort.  But I think you are
trying to optimize the DTM in some way to not follow that kind of protocol.

DTM is still following 2PC protocol:
First transaction is saved in WAL at all nodes and only after it commit is completed at all nodes.
We try to avoid maintaining of separate log files for 2PC (as now for prepared transactions)
and do not want to change logic of work with WAL.

DTM approach is based on the assumption that PostgreSQL CLOG and visibility rules allows to "hide" transaction even if it is committed in WAL.


By the way, how will arbiter does the recovery in a scenario where it
crashes, won't it need to contact all nodes for the status of in-progress or
prepared transactions? 

The current answer is that arbiter can not crash. To provide fault tolerance we spawn replicas of arbiter which are managed using Raft protocol.
If master is crashed or network is partitioned then new master is chosen.
PostgreSQL backends have list of possible arbiter addresses. Once connection with arbiter is broken, backend tries to reestablish connection using alternative addresses.
But only master accepts incomming connections.


I think it would be better if more detailed design of DTM with respect to
transaction management and recovery could be updated on wiki for having
discussion on this topic.  I have seen that you have already updated many
details of the system, but still the complete picture of DTM is not clear.

I agree.
But please notice that pg_dtm is just one of the possible implementations of distributed transaction management.
We also experimenting with other implementations, for example pg_tsftm based on timestamps. It doesn't require central arbiter and so shows much better (almost linear) scalability.
But recovery in case of pg_tsdtm is even more obscure.
Also performance of pg_tsdtm greatly depends on system clock synchronization and network delays. We git about 70k TPS on cluster with 12 nodes connected with 10Gbit network., 
But when we run the same test on hosts located in different geographic regions (several thousands km), then performance falls down to 15 TPS.
 





With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Kevin Grittner
Date:
<div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida
Grande,sans-serif;font-size:16px">On Tuesday, November 17, 2015 12:43 AM, konstantin knizhnik
<k.knizhnik@postgrespro.ru>wrote:<br class="" id="yui_3_16_0_1_1447334196169_995702" />> On Nov 16, 2015, at
11:21PM, Kevin Grittner wrote:<br class="" id="yui_3_16_0_1_1447334196169_995704" /><br class=""
id="yui_3_16_0_1_1447334196169_995706"/>>> If you are saying that DTM tries to roll back a transaction after<br
class=""id="yui_3_16_0_1_1447334196169_995708" />>> any participating server has entered the
RecordTransactionCommit()<brclass="" id="yui_3_16_0_1_1447334196169_995710" />>> critical section, then IMO it is
broken. Full stop.  That can't<br class="" id="yui_3_16_0_1_1447334196169_995712" />>> work with any reasonable
semanticsas far as I can see.<br class="" id="yui_3_16_0_1_1447334196169_995714" />><br class=""
id="yui_3_16_0_1_1447334196169_995716"/>> DTM is not trying to rollback committed transaction.<br class=""
id="yui_3_16_0_1_1447334196169_995718"/>> What he tries to do is to hide this commit.<br class=""
id="yui_3_16_0_1_1447334196169_995720"/>> As I already wrote, the idea was to implement "lightweight" 2PC<br
class=""id="yui_3_16_0_1_1447334196169_995722" />> because prepared transactions mechanism in PostgreSQL adds too
much<brclass="" id="yui_3_16_0_1_1447334196169_995724" />> overhead and cause soe problems with recovery.<br
class=""id="yui_3_16_0_1_1447334196169_995726" /><br class="" id="yui_3_16_0_1_1447334196169_995728" />The point
remainsthat there must be *some* "point of no return"<br class="" id="yui_3_16_0_1_1447334196169_995730" />beyond which
rollback(or "hiding" is not possible).  Until this<br class="" id="yui_3_16_0_1_1447334196169_995732" />point, all
heavyweightlocks held by the transaction must be<br class="" id="yui_3_16_0_1_1447334196169_995734" />maintained
withoutinterruption, data modification of the<br class="" id="yui_3_16_0_1_1447334196169_995736" />transaction must not
bevisible, and any attempt to update or<br class="" id="yui_3_16_0_1_1447334196169_995738" />delete data updated or
deletedby the transaction must block or<br class="" id="yui_3_16_0_1_1447334196169_995740" />throw an error.  It sounds
likeyou are attempting to move the<br class="" id="yui_3_16_0_1_1447334196169_995742" />point at which this "point of
noreturn" is, but it isn't as clear<br class="" id="yui_3_16_0_1_1447334196169_995744" />as I would like.  It seems
likeall participating nodes are<br class="" id="yui_3_16_0_1_1447334196169_995746" />responsible for notifying the
arbiterthat they have completed, and<br class="" id="yui_3_16_0_1_1447334196169_995748" />until then the arbiter gets
involvedin every visibility check,<br class="" id="yui_3_16_0_1_1447334196169_995750" />overriding the "normal"
value?<brclass="" id="yui_3_16_0_1_1447334196169_995752" /><br class="" id="yui_3_16_0_1_1447334196169_995754" />>
Thetransaction is normally committed in xlog, so that it can<br class="" id="yui_3_16_0_1_1447334196169_995756" />>
alwaysbe recovered in case of node fault.<br class="" id="yui_3_16_0_1_1447334196169_995758" />> But before setting
correspondentbit(s) in CLOG and releasing<br class="" id="yui_3_16_0_1_1447334196169_995760" />> locks we first
contactarbiter to get global status of transaction.<br class="" id="yui_3_16_0_1_1447334196169_995762" />> If it is
successfullylocally committed by all nodes, then<br class="" id="yui_3_16_0_1_1447334196169_995764" />> arbiter
approvescommit and commit of transaction normally<br class="" id="yui_3_16_0_1_1447334196169_995766" />>
completed.<brclass="" id="yui_3_16_0_1_1447334196169_995768" />> Otherwise arbiter rejects commit. In this case DTM
marks<brclass="" id="yui_3_16_0_1_1447334196169_995770" />> transaction as aborted in CLOG and returns error to the
client.<brclass="" id="yui_3_16_0_1_1447334196169_995772" />> XLOG is not changed and in case of failure PostgreSQL
willtry to<br class="" id="yui_3_16_0_1_1447334196169_995774" />> replay this transaction.<br class=""
id="yui_3_16_0_1_1447334196169_995776"/>> But during recovery it also tries to restore transaction status<br
class=""id="yui_3_16_0_1_1447334196169_995778" />> in CLOG.<br class="" id="yui_3_16_0_1_1447334196169_995780"
/>>And at this placeDTM contacts arbiter to know status of<br class="" id="yui_3_16_0_1_1447334196169_995782" />>
transaction.<brclass="" id="yui_3_16_0_1_1447334196169_995784" />> If it is marked as aborted in arbiter's CLOG,
thenit wiull be<br class="" id="yui_3_16_0_1_1447334196169_995786" />> also marked as aborted in local CLOG.<br
class=""id="yui_3_16_0_1_1447334196169_995788" />> And according to PostgreSQL visibility rules no other
transaction<brclass="" id="yui_3_16_0_1_1447334196169_995790" />> will see changes made by this transaction.<br
class=""id="yui_3_16_0_1_1447334196169_995792" /><br class="" id="yui_3_16_0_1_1447334196169_995794" />If a node goes
throughcrash and recovery after it has written its<br class="" id="yui_3_16_0_1_1447334196169_995796" />commit
informationto xlog, how are its heavyweight locks, etc.,<br class="" id="yui_3_16_0_1_1447334196169_995798"
/>maintainedthroughout?  For example, does each arbiter node have<br class="" id="yui_3_16_0_1_1447334196169_995800"
/>thecomplete set of heavyweight locks?  (Basically, all the<br class="" id="yui_3_16_0_1_1447334196169_995802"
/>informationwhich can be written to files in pg_twophase must be<br class="" id="yui_3_16_0_1_1447334196169_995804"
/>heldsomewhere by all arbiter nodes, and used where appropriate.)<br class="" id="yui_3_16_0_1_1447334196169_995806"
/><brclass="" id="yui_3_16_0_1_1447334196169_995808" />If a participating node is lost after some other nodes have
told<brclass="" id="yui_3_16_0_1_1447334196169_995810" />the arbiter that they have committed, and the lost node will
never<brclass="" id="yui_3_16_0_1_1447334196169_995812" />be able to indicate that it is committed or rolled back, what
is<brclass="" id="yui_3_16_0_1_1447334196169_995814" />the mechanism for resolving that?<br class=""
id="yui_3_16_0_1_1447334196169_995816"/><br class="" id="yui_3_16_0_1_1447334196169_995818" />>>> We can not
justcall elog(ERROR,...) in SetTransactionStatus<br class="" id="yui_3_16_0_1_1447334196169_995820" />>>>
implementationbecause inside critical section it cause Postgres<br class="" id="yui_3_16_0_1_1447334196169_995822"
/>>>>crash with panic message. So we have to remember that transaction is<br class=""
id="yui_3_16_0_1_1447334196169_995824"/>>>> rejected and report error later after exit from critical
section:<brclass="" id="yui_3_16_0_1_1447334196169_995826" />>><br class=""
id="yui_3_16_0_1_1447334196169_995828"/>>> I don't believe that is a good plan.  You should not enter the<br
class=""id="yui_3_16_0_1_1447334196169_995830" />>> critical section for recording that a commit is complete
untilall<br class="" id="yui_3_16_0_1_1447334196169_995832" />>> the work for the commit is done except for
tellingthe all the<br class="" id="yui_3_16_0_1_1447334196169_995834" />>> servers that all servers are ready.<br
class=""id="yui_3_16_0_1_1447334196169_995836" />><br class="" id="yui_3_16_0_1_1447334196169_995838" />> It is
goodpoint.<br class="" id="yui_3_16_0_1_1447334196169_995840" />> May be it is the reason of performance scalability
problemswe<br class="" id="yui_3_16_0_1_1447334196169_995842" />> have noticed with DTM.<br class=""
id="yui_3_16_0_1_1447334196169_995844"/><br class="" id="yui_3_16_0_1_1447334196169_995846" />Well, certainly the first
phaseof two-phase commit can take place<br class="" id="yui_3_16_0_1_1447334196169_995848" />in parallel, and once that
iscomplete then the second phase<br class="" id="yui_3_16_0_1_1447334196169_995850" />(commit or rollback of all the
participatingprepared transactions)<br class="" id="yui_3_16_0_1_1447334196169_995852" />can take place in parallel. 
Thereis no need to serialize that.<br class="" id="yui_3_16_0_1_1447334196169_995854" /><br class=""
id="yui_3_16_0_1_1447334196169_995856"/>> Sorry, some clarification.<br class=""
id="yui_3_16_0_1_1447334196169_995858"/>> We get 10x slowdown of performance caused by 2pc on very heavy<br class=""
id="yui_3_16_0_1_1447334196169_995860"/>> load on the IBM system with 256 cores.<br class=""
id="yui_3_16_0_1_1447334196169_995862"/>> At "normal" servers slowdown of 2pc is smaller - about 2x.<br class=""
id="yui_3_16_0_1_1447334196169_995864"/><br class="" id="yui_3_16_0_1_1447334196169_995866" />That suggests some
contentionpoint, probably on spinlocks.  Were<br class="" id="yui_3_16_0_1_1447334196169_995868" />you able to identify
theparticular hot spot(s)?<br class="" id="yui_3_16_0_1_1447334196169_995870" /><br class=""
id="yui_3_16_0_1_1447334196169_995872"/><br class="" id="yui_3_16_0_1_1447334196169_995874" />On Tuesday, November 17,
20153:09 AM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:<br class=""
id="yui_3_16_0_1_1447334196169_995876"/>> On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:<br class=""
id="yui_3_16_0_1_1447334196169_995878"/><br class="" id="yui_3_16_0_1_1447334196169_995880" />>> I think the
generalidea is that if Commit is WAL logged, then the<br class="" id="yui_3_16_0_1_1447334196169_995882" />>>
operationis considered to committed on local node and commit should<br class="" id="yui_3_16_0_1_1447334196169_995884"
/>>>happen on any node, only once prepare from all nodes is successful.<br class=""
id="yui_3_16_0_1_1447334196169_995886"/>>> And after that transaction is not supposed to abort.  But I think you
are<brclass="" id="yui_3_16_0_1_1447334196169_995888" />>> trying to optimize the DTM in some way to not follow
thatkind of protocol.<br class="" id="yui_3_16_0_1_1447334196169_995890" />><br class=""
id="yui_3_16_0_1_1447334196169_995892"/>> DTM is still following 2PC protocol:<br class=""
id="yui_3_16_0_1_1447334196169_995894"/>> First transaction is saved in WAL at all nodes and only after it<br
class=""id="yui_3_16_0_1_1447334196169_995896" />> commit is completed at all nodes.<br class=""
id="yui_3_16_0_1_1447334196169_995898"/><br class="" id="yui_3_16_0_1_1447334196169_995900" />So, essentially you are
treatingthe traditional commit point as<br class="" id="yui_3_16_0_1_1447334196169_995902" />phase 1 in a new approach
totwo-phase commit, and adding another<br class="" id="yui_3_16_0_1_1447334196169_995904" />layer to override normal
visibilitychecking and record locks<br class="" id="yui_3_16_0_1_1447334196169_995906" />(etc.) past that point?<br
class=""id="yui_3_16_0_1_1447334196169_995908" /><br class="" id="yui_3_16_0_1_1447334196169_995910" />> We try to
avoidmaintaining of separate log files for 2PC (as now<br class="" id="yui_3_16_0_1_1447334196169_995912" />> for
preparedtransactions) and do not want to change logic of<br class="" id="yui_3_16_0_1_1447334196169_995914" />> work
withWAL.<br class="" id="yui_3_16_0_1_1447334196169_995916" />><br class="" id="yui_3_16_0_1_1447334196169_995918"
/>>DTM approach is based on the assumption that PostgreSQL CLOG and<br class=""
id="yui_3_16_0_1_1447334196169_995920"/>> visibility rules allows to "hide" transaction even if it is<br class=""
id="yui_3_16_0_1_1447334196169_995922"/>> committed in WAL.<br class="" id="yui_3_16_0_1_1447334196169_995924" /><br
class=""id="yui_3_16_0_1_1447334196169_995926" />I see where you could get a performance benefit from not recording<br
class=""id="yui_3_16_0_1_1447334196169_995928" />(and cleaning up) persistent state for a transaction in the<br
class=""id="yui_3_16_0_1_1447334196169_995930" />pg_twophase directory between the time the transaction is prepared<br
class=""id="yui_3_16_0_1_1447334196169_995932" />and when it is committed (which should normally be a very short<br
class=""id="yui_3_16_0_1_1447334196169_995934" />period of time, but must survive crashes, and communication<br
class=""id="yui_3_16_0_1_1447334196169_995936" />failures).  Essentially you are trying to keep that in RAM instead,<br
class=""id="yui_3_16_0_1_1447334196169_995938" />and counting on multiple processes at different locations<br class=""
id="yui_3_16_0_1_1447334196169_995940"/>redundantly (and synchronously) storing this data to ensure<br class=""
id="yui_3_16_0_1_1447334196169_995942"/>persistence, rather than writing the data to disk files when are<br class=""
id="yui_3_16_0_1_1447334196169_995944"/>deleted as soon as the prepared transaction is committed or rolled<br class=""
id="yui_3_16_0_1_1447334196169_995946"/>back.<br class="" id="yui_3_16_0_1_1447334196169_995948" /><br class=""
id="yui_3_16_0_1_1447334196169_995950"/>I wonder whether it might not be safer to just do that -- rather<br class=""
id="yui_3_16_0_1_1447334196169_995952"/>than trying to develop a whole new way of implementing two-phase<br class=""
id="yui_3_16_0_1_1447334196169_995954"/>commit, just come up with a new way to persist the information<br class=""
id="yui_3_16_0_1_1447334196169_995956"/>which must survive between the prepared and the later commit or<br class=""
id="yui_3_16_0_1_1447334196169_995958"/>rollback of the prepared transaction.  Essentially, provide hooks<br class=""
id="yui_3_16_0_1_1447334196169_995960"/>for persisting the data when preparing a transaction, and the<br class=""
id="yui_3_16_0_1_1447334196169_995962"/>arbiter would set the hooks to a function to send the data there.<br class=""
id="yui_3_16_0_1_1447334196169_995964"/>Likewise with the release of the information (normally a very small<br class=""
id="yui_3_16_0_1_1447334196169_995966"/>fraction of a second later).  The rest of the arbiter code becomes<br class=""
id="yui_3_16_0_1_1447334196169_995968"/>a distributed transaction manager.  It's not a trivial job to get<br class=""
id="yui_3_16_0_1_1447334196169_995970"/>that right, but at least it is a very well-understood problem, and<br class=""
id="yui_3_16_0_1_1447334196169_995972"/>is not likely to take as long to develop and shake out tricky<br class=""
id="yui_3_16_0_1_1447334196169_995974"/>data-eating bugs.<br class="" id="yui_3_16_0_1_1447334196169_995976" /><br
class=""id="yui_3_16_0_1_1447334196169_995978" />--<br class="" id="yui_3_16_0_1_1447334196169_995980" />Kevin
Grittner<brclass="" id="yui_3_16_0_1_1447334196169_995982" />EDB: http://www.enterprisedb.com<br class=""
id="yui_3_16_0_1_1447334196169_995984"/>The Enterprise PostgreSQL Company<br class=""
id="yui_3_16_0_1_1447334196169_995986"/><div dir="ltr"><br /></div></div> 

Re: Question concerning XTM (eXtensible Transaction Manager API)

From
Alvaro Herrera
Date:
konstantin knizhnik wrote:

> The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.
> But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of
transaction.
> If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally
completed.
> Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the
client.
> XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction.
> But during recovery it also tries to restore transaction status in CLOG.
> And at this placeDTM contacts arbiter to know status of transaction.
> If it is marked as aborted in arbiter's CLOG, then it wiull be also marked as aborted in local CLOG.
> And according to PostgreSQL visibility rules no other transaction will see changes made by this transaction.

One problem I see with this approach is that the WAL replay can happen
long after it was written; for instance you might have saved a
basebackup and WAL stream and replay it all several days or weeks later,
when the arbiter no longer has information about the XID.  Later
transactions might (will) depend on the aborted state of the transaction
in question, so this effectively corrupts the database.

In other words, while it's reasonable to require that the arbiter can
always be contacted for transaction commit/abort at run time, but it's
not reasonable to contact the arbiter during WAL replay.

I think this merits more explanation:

> The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.

Why would anyone want to "recover" a transaction that was aborted?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services