Thread: Question concerning XTM (eXtensible Transaction Manager API)
Hello,<br /><br /> Some time ago at PgConn.Vienna we have proposed eXtensible Transaction Manager API (XTM).<br /> The ideais to be able to provide custom implementation of transaction managers as standard Postgres extensions,<br /> primarygoal is implementation of distritibuted transaction manager.<br /> It should not only support 2PC, but also provideconsistent snapshots for global transaction executed at different nodes.<br /><br /> Actually, current version ofXTM API propose any particular 2PC model. It can be implemented either at coordinator side<br /> (as it is done in our<a href="https://github.com/postgrespro/pg_tsdtm">pg_tsdtm</a> implementation based on timestamps and not requiring centralizedarbiter), either by arbiter<br /> (<a href="https://github.com/postgrespro/pg_dtm">pg_dtm</a>). In the last case2PC logic is hidden under XTM SetTransactionStatus method:<br /><br /> bool (*SetTransactionStatus)(TransactionIdxid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn);<br /><br/> which encapsulates TransactionIdSetTreeStatus in clog.c.<br /> But you may notice that original TransactionIdSetTreeStatusfunction is void - it is not intended to return anything.<br /> It is called in RecordTransactionCommitin critical section where it is not expected that commit may fail.<br /> But in case of DTM transactionmay be rejected by arbiter. XTM API allows to control access to CLOG, so everybody will see that transaction isaborted. But we in any case have to somehow notify client about abort of transaction.<br /><br /> We can not just callelog(ERROR,...) in SetTransactionStatus implementation because inside critical section it cause Postgres crash with panicmessage. So we have to remember that transaction is rejected and report error later after exit from critical section:<br/><br /><br /> /*<br /> * Now we may update the CLOG, if we wrote a COMMIT record above<br /> */<br /> if (markXidCommitted) {<br /> committed = TransactionIdCommitTree(xid, nchildren,children);<br /> }<br /> ...<br /> /*<br /> * If we entered a commit critical section, leave itnow, and let<br /> * checkpoints proceed.<br /> */<br /> if (markXidCommitted)<br /> {<br /> MyPgXact->delayChkpt = false;<br /> END_CRIT_SECTION();<br /> if (!committed) {<br /> CurrentTransactionState->state= TRANS_ABORT;<br /> CurrentTransactionState->blockState = TBLOCK_ABORT_PENDING;<br/> elog(ERROR, "Transaction commit rejected by XTM");<br /> }<br /> }<br/><br /> There is one more problem - at this moment the state of transaction is TRANS_COMMIT.<br /> If ERROR handlerwill try to abort it, then we get yet another fatal error: attempt to rollback committed transaction.<br /> So weneed to hide the fact that transaction is actually committed in local XLOG.<br /><br /> This approach works but looks alittle bit like hacker approach. It requires not only to replace direct call of TransactionIdSetTreeStatus with indirect(though XTM API), but also requires to make some non obvious changes in RecordTransactionCommit.<br /><br /> Sowhat are the alternatives?<br /><br /> 1. Move RecordTransactionCommit to XTM. In this case we have to copy original RecordTransactionCommitto DTM implementation and patch it here. It is also not nice, because it will complicate maintenanceof DTM implementation.<br /> The primary idea of XTM is to allow development of DTM as standard PostgreSQL extensionwithout creating of specific clones of main PostgreSQL source tree. But this idea will be compromised if we havecopy&paste some pieces of PostgreSQL code.<br /> In some sense it is even worser than maintaining separate branch- in last case at least we have some way to perfrtom automatic merge.<br /><br /> 2. Propose some alternative two-phasecommit implementation in PostgreSQL core. The main motivation for such "lightweight" implementation of 2PC in pg_dtmis that original mechanism of prepared transactions in PostgreSQL adds to much overhead.<br /> In our benchmarks wehave found that simple credit-debit banking test (without any DTM) works almost 10 times slower with PostgreSQL 2PC thanwithout it. This is why we try to propose alternative solution (right now pg_dtm is 2 times slower than vanilla PostgreSQL,but it not only performs 2PC but also provide consistent snapshots).<br /><br /> May be somebody can suggest someother solution?<br /> Or give some comments concerning current approach?<br /><br /> Thank in advance, <br /> Konstantin,<br /> Postgres Professional<br /><br />
Konstantin Knizhnik wrote: > But you may notice that original TransactionIdSetTreeStatus function is void > - it is not intended to return anything. > It is called in RecordTransactionCommit in critical section where it is not > expected that commit may fail. > But in case of DTM transaction may be rejected by arbiter. XTM API allows to > control access to CLOG, so everybody will see that transaction is aborted. > But we in any case have to somehow notify client about abort of transaction. I think you'll need to rethink how a transaction commits rather completely, rather than consider localized tweaks to specific functions. For one thing, the WAL record about transaction commit has already been written by XactLogCommitRecord much earlier than calling TransactionIdCommitTree. So if you were to crash at that point, it doesn't matter how much the arbiter has rejected the transaction, WAL replay would mark it as committed. Also, what about the replication origin stuff and the TransactionTreeSetCommitTsData() call? I think you need to involve the arbiter earlier, so that the commit process can be aborted earlier than those things. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<div class="moz-cite-prefix">On 11/16/2015 10:54 PM, Alvaro Herrera wrote:<br /></div><blockquote cite="mid:20151116195453.GC614468@alvherre.pgsql"type="cite"><pre wrap="">Konstantin Knizhnik wrote: </pre><blockquote type="cite"><pre wrap="">But you may notice that original TransactionIdSetTreeStatus function is void - it is not intended to return anything. It is called in RecordTransactionCommit in critical section where it is not expected that commit may fail. But in case of DTM transaction may be rejected by arbiter. XTM API allows to control access to CLOG, so everybody will see that transaction is aborted. But we in any case have to somehow notify client about abort of transaction. </pre></blockquote><pre wrap=""> I think you'll need to rethink how a transaction commits rather completely, rather than consider localized tweaks to specific functions. For one thing, the WAL record about transaction commit has already been written by XactLogCommitRecord much earlier than calling TransactionIdCommitTree. So if you were to crash at that point, it doesn't matter how much the arbiter has rejected the transaction, WAL replay would mark it as committed. </pre></blockquote><br /> Yes, WAL replay will recover this transaction and try to markit in CLOG as completed, but ... we have <font face="Arial">caught </font>control over CLOG using XTM.<br /> And insteadof direct writing to CLOG, DTM will contact arbiter and ask his opinion concerning this transaction. <br /> If arbiterdoesn't think that it was committed, then it will not be marked as committed in local CLOG.<br /><br /><br /><blockquotecite="mid:20151116195453.GC614468@alvherre.pgsql" type="cite"><pre wrap=""> Also, what about the replication origin stuff and the TransactionTreeSetCommitTsData() call? I think you need to involve the arbiter earlier, so that the commit process can be aborted earlier than those things. </pre></blockquote><br />
On Monday, November 16, 2015 2:47 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Some time ago at PgConn.Vienna we have proposed eXtensible > Transaction Manager API (XTM). > The idea is to be able to provide custom implementation of > transaction managers as standard Postgres extensions, > primary goal is implementation of distritibuted transaction manager. > It should not only support 2PC, but also provide consistent > snapshots for global transaction executed at different nodes. > > Actually, current version of XTM API propose any particular 2PC > model. It can be implemented either at coordinator side > (as it is done in our pg_tsdtm implementation based on timestamps > and not requiring centralized arbiter), either by arbiter > (pg_dtm). I'm not entirely clear on what you're saying here. I admit I've not kept in close touch with the distributed processing discussions lately -- is there a write-up and/or diagram to give an overview of where we're at with this effort? > In the last case 2PC logic is hidden under XTM > SetTransactionStatus method: > > bool (*SetTransactionStatus)(TransactionId xid, int nsubxids, > TransactionId *subxids, XidStatus status, XLogRecPtr lsn); > > which encapsulates TransactionIdSetTreeStatus in clog.c. > But you may notice that original TransactionIdSetTreeStatus function > is void - it is not intended to return anything. > It is called in RecordTransactionCommit in critical section where it > is not expected that commit may fail. This issue, though, seems clear enough. At some point a transaction must cross a hard line between when it is not committed and when it is, since after commit subsequent transactions can then see the data and modify it. There has to be some "point of no return" in order to have any sane semantics. Entering that critical section is it. > But in case of DTM transaction may be rejected by arbiter. XTM API > allows to control access to CLOG, so everybody will see that > transaction is aborted. But we in any case have to somehow notify > client about abort of transaction. If you are saying that DTM tries to roll back a transaction after any participating server has entered the RecordTransactionCommit() critical section, then IMO it is broken. Full stop. That can't work with any reasonable semantics as far as I can see. > We can not just call elog(ERROR,...) in SetTransactionStatus > implementation because inside critical section it cause Postgres > crash with panic message. So we have to remember that transaction is > rejected and report error later after exit from critical section: I don't believe that is a good plan. You should not enter the critical section for recording that a commit is complete until all the work for the commit is done except for telling the all the servers that all servers are ready. > There is one more problem - at this moment the state of transaction > is TRANS_COMMIT. > If ERROR handler will try to abort it, then we get yet another fatal > error: attempt to rollback committed transaction. > So we need to hide the fact that transaction is actually committed > in local XLOG. That is one of pretty much an infinite variety of problems you have if you don't have a "hard line" for when the transaction is finally committed. > This approach works but looks a little bit like hacker approach. It > requires not only to replace direct call of > TransactionIdSetTreeStatus with indirect (though XTM API), but also > requires to make some non obvious changes in > RecordTransactionCommit. > > So what are the alternatives? > > 1. Move RecordTransactionCommit to XTM. In this case we have to copy > original RecordTransactionCommit to DTM implementation and patch it > here. It is also not nice, because it will complicate maintenance of > DTM implementation. > The primary idea of XTM is to allow development of DTM as standard > PostgreSQL extension without creating of specific clones of main > PostgreSQL source tree. But this idea will be compromised if we have > copy&paste some pieces of PostgreSQL code. > In some sense it is even worser than maintaining separate branch - > in last case at least we have some way to perfrtom automatic merge. You can have a call in XTM that says you want to record the the commit on all participating servers, but I don't see where that would involve moving anything we have now out of each participating server -- it would just need to function like a real, professional-quality distributed transaction manager doing the second phase of a two-phase commit. If any participating server goes through the first phase and reports that all the heavy lifting is done, and then is swallowed up in a pyroclastic flow of an erupting volcano before phase 2 comes around, the DTM must periodically retry until the administrator cancels the attempt. > 2. Propose some alternative two-phase commit implementation in > PostgreSQL core. The main motivation for such "lightweight" > implementation of 2PC in pg_dtm is that original mechanism of > prepared transactions in PostgreSQL adds to much overhead. > In our benchmarks we have found that simple credit-debit banking > test (without any DTM) works almost 10 times slower with PostgreSQL > 2PC than without it. This is why we try to propose alternative > solution (right now pg_dtm is 2 times slower than vanilla > PostgreSQL, but it not only performs 2PC but also provide consistent > snapshots). Are you talking about 10x the latency on a commit, or that the overall throughput under saturation load is one tenth of running without something to guarantee the transactional integrity of the whole set of nodes? The former would not be too surprising, while the latter would be rather amazing. One interesting coincidence is that I've been told off-list about a large company which developed their own internal distributed version of PostgreSQL in which they implemented Serializable Snapshot Isolation, and reported a 10x performance impact. They were reportedly fine with that, since the integrity of their data was worth it to them and it performed better than the alternatives; but the number matching like that does make me wonder how hard it would be to break that barrier in terms of latency, if you really preserve ACID properties across a distributed system. Of course, latency and throughput are two completely different things. > May be somebody can suggest some other solution? Well, it seems like it might be possible for some short cuts to be taken when the part of the distributed transaction run on a particular server was read-only. I'm not sure of the details, but it seems conceptually possible to minimize network latency issues. > Or give some comments concerning current approach? I don't see how the DTM approach described could provide anything resembling ACID guarantees; but if you want to give up all claim to that, you might be able to provide something useful with a more lax set of assurances. (After all, there are some popular products out there which are pretty "relaxed" about such things.) In that case maybe the XTM interface could include some way to query what guarantees an implementation does provide. I don't mean this sarcastically -- there are use cases for different levels of guarantee here. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for your response.
On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:
I'm not entirely clear on what you're saying here. I admit I've
not kept in close touch with the distributed processing discussions
lately -- is there a write-up and/or diagram to give an overview of
where we're at with this effort?
If you are saying that DTM tries to roll back a transaction after
any participating server has entered the RecordTransactionCommit()
critical section, then IMO it is broken. Full stop. That can't
work with any reasonable semantics as far as I can see.
DTM is not trying to rollback committed transaction.
What he tries to do is to hide this commit.
As I already wrote, the idea was to implement "lightweight" 2PC because prepared transactions mechanism in PostgreSQL adds too much overhead and cause soe problems with recovery.
The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.
But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of transaction.
If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally completed.
Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the client.
XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction.
But during recovery it also tries to restore transaction status in CLOG.
And at this placeDTM contacts arbiter to know status of transaction.
If it is marked as aborted in arbiter's CLOG, then it wiull be also marked as aborted in local CLOG.
And according to PostgreSQL visibility rules no other transaction will see changes made by this transaction.
We can not just call elog(ERROR,...) in SetTransactionStatusimplementation because inside critical section it cause Postgrescrash with panic message. So we have to remember that transaction isrejected and report error later after exit from critical section:
I don't believe that is a good plan. You should not enter the
critical section for recording that a commit is complete until all
the work for the commit is done except for telling the all the
servers that all servers are ready.
It is good point.
May be it is the reason of performance scalability problems we have noticed with DTM.
In our benchmarks we have found that simple credit-debit bankingtest (without any DTM) works almost 10 times slower with PostgreSQL2PC than without it. This is why we try to propose alternativesolution (right now pg_dtm is 2 times slower than vanillaPostgreSQL, but it not only performs 2PC but also provide consistentsnapshots).
Are you talking about 10x the latency on a commit, or that the
overall throughput under saturation load is one tenth of running
without something to guarantee the transactional integrity of the
whole set of nodes? The former would not be too surprising, while
the latter would be rather amazing.
Sorry, some clarification.
We get 10x slowdown of performance caused by 2pc on very heavy load on the IBM system with 256 cores.
At "normal" servers slowdown of 2pc is smaller - about 2x.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Nov 17, 2015 at 12:12 PM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Thank you for your response.On Nov 16, 2015, at 11:21 PM, Kevin Grittner wrote:I'm not entirely clear on what you're saying here. I admit I've
not kept in close touch with the distributed processing discussions
lately -- is there a write-up and/or diagram to give an overview of
where we're at with this effort?
If you are saying that DTM tries to roll back a transaction after
any participating server has entered the RecordTransactionCommit()
critical section, then IMO it is broken. Full stop. That can't
work with any reasonable semantics as far as I can see.DTM is not trying to rollback committed transaction.What he tries to do is to hide this commit.As I already wrote, the idea was to implement "lightweight" 2PC because prepared transactions mechanism in PostgreSQL adds too much overhead and cause soe problems with recovery.The transaction is normally committed in xlog, so that it can always be recovered in case of node fault.But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of transaction.If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally completed.Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the client.XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction.But during recovery it also tries to restore transaction status in CLOG.And at this placeDTM contacts arbiter to know status of transaction.
I think the general idea is that if Commit is WAL logged, then the
operation is considered to committed on local node and commit should
happen on any node, only once prepare from all nodes is successful.
And after that transaction is not supposed to abort. But I think you are
trying to optimize the DTM in some way to not follow that kind of protocol.
By the way, how will arbiter does the recovery in a scenario where it
crashes, won't it need to contact all nodes for the status of in-progress or
prepared transactions?
I think it would be better if more detailed design of DTM with respect to
transaction management and recovery could be updated on wiki for having
discussion on this topic. I have seen that you have already updated many
details of the system, but still the complete picture of DTM is not clear.
<p dir="ltr"><br /> > I think the general idea is that if Commit is WAL logged, then the<br /> > operation is consideredto committed on local node and commit should<br /> > happen on any node, only once prepare from all nodes issuccessful.<br /> > And after that transaction is not supposed to abort. But I think you are<br /> > trying to optimizethe DTM in some way to not follow that kind of protocol.<br /> > By the way, how will arbiter does the recoveryin a scenario where it<br /> > crashes, won't it need to contact all nodes for the status of in-progress or<br/> > prepared transactions? <br /> > I think it would be better if more detailed design of DTM with respect to<br/> > transaction management and recovery could be updated on wiki for having<br /> > discussion on this topic. I have seen that you have already updated many<br /> > details of the system, but still the complete picture ofDTM is not clear.<br /><p dir="ltr">I agree.<p dir="ltr">I have not been following this discussion but from what I haveread above I think the recovery model in this design is broken. You have to follow some protocol, whichever you choose.<pdir="ltr">I think you can try using something like Paxos, if you are looking at a higher reliable model but don'twant the overhead of 3PC.
On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:
I think the general idea is that if Commit is WAL logged, then theoperation is considered to committed on local node and commit shouldhappen on any node, only once prepare from all nodes is successful.And after that transaction is not supposed to abort. But I think you aretrying to optimize the DTM in some way to not follow that kind of protocol.
DTM is still following 2PC protocol:
First transaction is saved in WAL at all nodes and only after it commit is completed at all nodes.
We try to avoid maintaining of separate log files for 2PC (as now for prepared transactions)
and do not want to change logic of work with WAL.
DTM approach is based on the assumption that PostgreSQL CLOG and visibility rules allows to "hide" transaction even if it is committed in WAL.
By the way, how will arbiter does the recovery in a scenario where itcrashes, won't it need to contact all nodes for the status of in-progress orprepared transactions?
The current answer is that arbiter can not crash. To provide fault tolerance we spawn replicas of arbiter which are managed using Raft protocol.
If master is crashed or network is partitioned then new master is chosen.
PostgreSQL backends have list of possible arbiter addresses. Once connection with arbiter is broken, backend tries to reestablish connection using alternative addresses.
But only master accepts incomming connections.
I think it would be better if more detailed design of DTM with respect totransaction management and recovery could be updated on wiki for havingdiscussion on this topic. I have seen that you have already updated manydetails of the system, but still the complete picture of DTM is not clear.
I agree.
But please notice that pg_dtm is just one of the possible implementations of distributed transaction management.
We also experimenting with other implementations, for example pg_tsftm based on timestamps. It doesn't require central arbiter and so shows much better (almost linear) scalability.
But recovery in case of pg_tsdtm is even more obscure.
Also performance of pg_tsdtm greatly depends on system clock synchronization and network delays. We git about 70k TPS on cluster with 12 nodes connected with 10Gbit network.,
But when we run the same test on hosts located in different geographic regions (several thousands km), then performance falls down to 15 TPS.
<div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande,sans-serif;font-size:16px">On Tuesday, November 17, 2015 12:43 AM, konstantin knizhnik <k.knizhnik@postgrespro.ru>wrote:<br class="" id="yui_3_16_0_1_1447334196169_995702" />> On Nov 16, 2015, at 11:21PM, Kevin Grittner wrote:<br class="" id="yui_3_16_0_1_1447334196169_995704" /><br class="" id="yui_3_16_0_1_1447334196169_995706"/>>> If you are saying that DTM tries to roll back a transaction after<br class=""id="yui_3_16_0_1_1447334196169_995708" />>> any participating server has entered the RecordTransactionCommit()<brclass="" id="yui_3_16_0_1_1447334196169_995710" />>> critical section, then IMO it is broken. Full stop. That can't<br class="" id="yui_3_16_0_1_1447334196169_995712" />>> work with any reasonable semanticsas far as I can see.<br class="" id="yui_3_16_0_1_1447334196169_995714" />><br class="" id="yui_3_16_0_1_1447334196169_995716"/>> DTM is not trying to rollback committed transaction.<br class="" id="yui_3_16_0_1_1447334196169_995718"/>> What he tries to do is to hide this commit.<br class="" id="yui_3_16_0_1_1447334196169_995720"/>> As I already wrote, the idea was to implement "lightweight" 2PC<br class=""id="yui_3_16_0_1_1447334196169_995722" />> because prepared transactions mechanism in PostgreSQL adds too much<brclass="" id="yui_3_16_0_1_1447334196169_995724" />> overhead and cause soe problems with recovery.<br class=""id="yui_3_16_0_1_1447334196169_995726" /><br class="" id="yui_3_16_0_1_1447334196169_995728" />The point remainsthat there must be *some* "point of no return"<br class="" id="yui_3_16_0_1_1447334196169_995730" />beyond which rollback(or "hiding" is not possible). Until this<br class="" id="yui_3_16_0_1_1447334196169_995732" />point, all heavyweightlocks held by the transaction must be<br class="" id="yui_3_16_0_1_1447334196169_995734" />maintained withoutinterruption, data modification of the<br class="" id="yui_3_16_0_1_1447334196169_995736" />transaction must not bevisible, and any attempt to update or<br class="" id="yui_3_16_0_1_1447334196169_995738" />delete data updated or deletedby the transaction must block or<br class="" id="yui_3_16_0_1_1447334196169_995740" />throw an error. It sounds likeyou are attempting to move the<br class="" id="yui_3_16_0_1_1447334196169_995742" />point at which this "point of noreturn" is, but it isn't as clear<br class="" id="yui_3_16_0_1_1447334196169_995744" />as I would like. It seems likeall participating nodes are<br class="" id="yui_3_16_0_1_1447334196169_995746" />responsible for notifying the arbiterthat they have completed, and<br class="" id="yui_3_16_0_1_1447334196169_995748" />until then the arbiter gets involvedin every visibility check,<br class="" id="yui_3_16_0_1_1447334196169_995750" />overriding the "normal" value?<brclass="" id="yui_3_16_0_1_1447334196169_995752" /><br class="" id="yui_3_16_0_1_1447334196169_995754" />> Thetransaction is normally committed in xlog, so that it can<br class="" id="yui_3_16_0_1_1447334196169_995756" />> alwaysbe recovered in case of node fault.<br class="" id="yui_3_16_0_1_1447334196169_995758" />> But before setting correspondentbit(s) in CLOG and releasing<br class="" id="yui_3_16_0_1_1447334196169_995760" />> locks we first contactarbiter to get global status of transaction.<br class="" id="yui_3_16_0_1_1447334196169_995762" />> If it is successfullylocally committed by all nodes, then<br class="" id="yui_3_16_0_1_1447334196169_995764" />> arbiter approvescommit and commit of transaction normally<br class="" id="yui_3_16_0_1_1447334196169_995766" />> completed.<brclass="" id="yui_3_16_0_1_1447334196169_995768" />> Otherwise arbiter rejects commit. In this case DTM marks<brclass="" id="yui_3_16_0_1_1447334196169_995770" />> transaction as aborted in CLOG and returns error to the client.<brclass="" id="yui_3_16_0_1_1447334196169_995772" />> XLOG is not changed and in case of failure PostgreSQL willtry to<br class="" id="yui_3_16_0_1_1447334196169_995774" />> replay this transaction.<br class="" id="yui_3_16_0_1_1447334196169_995776"/>> But during recovery it also tries to restore transaction status<br class=""id="yui_3_16_0_1_1447334196169_995778" />> in CLOG.<br class="" id="yui_3_16_0_1_1447334196169_995780" />>And at this placeDTM contacts arbiter to know status of<br class="" id="yui_3_16_0_1_1447334196169_995782" />> transaction.<brclass="" id="yui_3_16_0_1_1447334196169_995784" />> If it is marked as aborted in arbiter's CLOG, thenit wiull be<br class="" id="yui_3_16_0_1_1447334196169_995786" />> also marked as aborted in local CLOG.<br class=""id="yui_3_16_0_1_1447334196169_995788" />> And according to PostgreSQL visibility rules no other transaction<brclass="" id="yui_3_16_0_1_1447334196169_995790" />> will see changes made by this transaction.<br class=""id="yui_3_16_0_1_1447334196169_995792" /><br class="" id="yui_3_16_0_1_1447334196169_995794" />If a node goes throughcrash and recovery after it has written its<br class="" id="yui_3_16_0_1_1447334196169_995796" />commit informationto xlog, how are its heavyweight locks, etc.,<br class="" id="yui_3_16_0_1_1447334196169_995798" />maintainedthroughout? For example, does each arbiter node have<br class="" id="yui_3_16_0_1_1447334196169_995800" />thecomplete set of heavyweight locks? (Basically, all the<br class="" id="yui_3_16_0_1_1447334196169_995802" />informationwhich can be written to files in pg_twophase must be<br class="" id="yui_3_16_0_1_1447334196169_995804" />heldsomewhere by all arbiter nodes, and used where appropriate.)<br class="" id="yui_3_16_0_1_1447334196169_995806" /><brclass="" id="yui_3_16_0_1_1447334196169_995808" />If a participating node is lost after some other nodes have told<brclass="" id="yui_3_16_0_1_1447334196169_995810" />the arbiter that they have committed, and the lost node will never<brclass="" id="yui_3_16_0_1_1447334196169_995812" />be able to indicate that it is committed or rolled back, what is<brclass="" id="yui_3_16_0_1_1447334196169_995814" />the mechanism for resolving that?<br class="" id="yui_3_16_0_1_1447334196169_995816"/><br class="" id="yui_3_16_0_1_1447334196169_995818" />>>> We can not justcall elog(ERROR,...) in SetTransactionStatus<br class="" id="yui_3_16_0_1_1447334196169_995820" />>>> implementationbecause inside critical section it cause Postgres<br class="" id="yui_3_16_0_1_1447334196169_995822" />>>>crash with panic message. So we have to remember that transaction is<br class="" id="yui_3_16_0_1_1447334196169_995824"/>>>> rejected and report error later after exit from critical section:<brclass="" id="yui_3_16_0_1_1447334196169_995826" />>><br class="" id="yui_3_16_0_1_1447334196169_995828"/>>> I don't believe that is a good plan. You should not enter the<br class=""id="yui_3_16_0_1_1447334196169_995830" />>> critical section for recording that a commit is complete untilall<br class="" id="yui_3_16_0_1_1447334196169_995832" />>> the work for the commit is done except for tellingthe all the<br class="" id="yui_3_16_0_1_1447334196169_995834" />>> servers that all servers are ready.<br class=""id="yui_3_16_0_1_1447334196169_995836" />><br class="" id="yui_3_16_0_1_1447334196169_995838" />> It is goodpoint.<br class="" id="yui_3_16_0_1_1447334196169_995840" />> May be it is the reason of performance scalability problemswe<br class="" id="yui_3_16_0_1_1447334196169_995842" />> have noticed with DTM.<br class="" id="yui_3_16_0_1_1447334196169_995844"/><br class="" id="yui_3_16_0_1_1447334196169_995846" />Well, certainly the first phaseof two-phase commit can take place<br class="" id="yui_3_16_0_1_1447334196169_995848" />in parallel, and once that iscomplete then the second phase<br class="" id="yui_3_16_0_1_1447334196169_995850" />(commit or rollback of all the participatingprepared transactions)<br class="" id="yui_3_16_0_1_1447334196169_995852" />can take place in parallel. Thereis no need to serialize that.<br class="" id="yui_3_16_0_1_1447334196169_995854" /><br class="" id="yui_3_16_0_1_1447334196169_995856"/>> Sorry, some clarification.<br class="" id="yui_3_16_0_1_1447334196169_995858"/>> We get 10x slowdown of performance caused by 2pc on very heavy<br class="" id="yui_3_16_0_1_1447334196169_995860"/>> load on the IBM system with 256 cores.<br class="" id="yui_3_16_0_1_1447334196169_995862"/>> At "normal" servers slowdown of 2pc is smaller - about 2x.<br class="" id="yui_3_16_0_1_1447334196169_995864"/><br class="" id="yui_3_16_0_1_1447334196169_995866" />That suggests some contentionpoint, probably on spinlocks. Were<br class="" id="yui_3_16_0_1_1447334196169_995868" />you able to identify theparticular hot spot(s)?<br class="" id="yui_3_16_0_1_1447334196169_995870" /><br class="" id="yui_3_16_0_1_1447334196169_995872"/><br class="" id="yui_3_16_0_1_1447334196169_995874" />On Tuesday, November 17, 20153:09 AM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:<br class="" id="yui_3_16_0_1_1447334196169_995876"/>> On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:<br class="" id="yui_3_16_0_1_1447334196169_995878"/><br class="" id="yui_3_16_0_1_1447334196169_995880" />>> I think the generalidea is that if Commit is WAL logged, then the<br class="" id="yui_3_16_0_1_1447334196169_995882" />>> operationis considered to committed on local node and commit should<br class="" id="yui_3_16_0_1_1447334196169_995884" />>>happen on any node, only once prepare from all nodes is successful.<br class="" id="yui_3_16_0_1_1447334196169_995886"/>>> And after that transaction is not supposed to abort. But I think you are<brclass="" id="yui_3_16_0_1_1447334196169_995888" />>> trying to optimize the DTM in some way to not follow thatkind of protocol.<br class="" id="yui_3_16_0_1_1447334196169_995890" />><br class="" id="yui_3_16_0_1_1447334196169_995892"/>> DTM is still following 2PC protocol:<br class="" id="yui_3_16_0_1_1447334196169_995894"/>> First transaction is saved in WAL at all nodes and only after it<br class=""id="yui_3_16_0_1_1447334196169_995896" />> commit is completed at all nodes.<br class="" id="yui_3_16_0_1_1447334196169_995898"/><br class="" id="yui_3_16_0_1_1447334196169_995900" />So, essentially you are treatingthe traditional commit point as<br class="" id="yui_3_16_0_1_1447334196169_995902" />phase 1 in a new approach totwo-phase commit, and adding another<br class="" id="yui_3_16_0_1_1447334196169_995904" />layer to override normal visibilitychecking and record locks<br class="" id="yui_3_16_0_1_1447334196169_995906" />(etc.) past that point?<br class=""id="yui_3_16_0_1_1447334196169_995908" /><br class="" id="yui_3_16_0_1_1447334196169_995910" />> We try to avoidmaintaining of separate log files for 2PC (as now<br class="" id="yui_3_16_0_1_1447334196169_995912" />> for preparedtransactions) and do not want to change logic of<br class="" id="yui_3_16_0_1_1447334196169_995914" />> work withWAL.<br class="" id="yui_3_16_0_1_1447334196169_995916" />><br class="" id="yui_3_16_0_1_1447334196169_995918" />>DTM approach is based on the assumption that PostgreSQL CLOG and<br class="" id="yui_3_16_0_1_1447334196169_995920"/>> visibility rules allows to "hide" transaction even if it is<br class="" id="yui_3_16_0_1_1447334196169_995922"/>> committed in WAL.<br class="" id="yui_3_16_0_1_1447334196169_995924" /><br class=""id="yui_3_16_0_1_1447334196169_995926" />I see where you could get a performance benefit from not recording<br class=""id="yui_3_16_0_1_1447334196169_995928" />(and cleaning up) persistent state for a transaction in the<br class=""id="yui_3_16_0_1_1447334196169_995930" />pg_twophase directory between the time the transaction is prepared<br class=""id="yui_3_16_0_1_1447334196169_995932" />and when it is committed (which should normally be a very short<br class=""id="yui_3_16_0_1_1447334196169_995934" />period of time, but must survive crashes, and communication<br class=""id="yui_3_16_0_1_1447334196169_995936" />failures). Essentially you are trying to keep that in RAM instead,<br class=""id="yui_3_16_0_1_1447334196169_995938" />and counting on multiple processes at different locations<br class="" id="yui_3_16_0_1_1447334196169_995940"/>redundantly (and synchronously) storing this data to ensure<br class="" id="yui_3_16_0_1_1447334196169_995942"/>persistence, rather than writing the data to disk files when are<br class="" id="yui_3_16_0_1_1447334196169_995944"/>deleted as soon as the prepared transaction is committed or rolled<br class="" id="yui_3_16_0_1_1447334196169_995946"/>back.<br class="" id="yui_3_16_0_1_1447334196169_995948" /><br class="" id="yui_3_16_0_1_1447334196169_995950"/>I wonder whether it might not be safer to just do that -- rather<br class="" id="yui_3_16_0_1_1447334196169_995952"/>than trying to develop a whole new way of implementing two-phase<br class="" id="yui_3_16_0_1_1447334196169_995954"/>commit, just come up with a new way to persist the information<br class="" id="yui_3_16_0_1_1447334196169_995956"/>which must survive between the prepared and the later commit or<br class="" id="yui_3_16_0_1_1447334196169_995958"/>rollback of the prepared transaction. Essentially, provide hooks<br class="" id="yui_3_16_0_1_1447334196169_995960"/>for persisting the data when preparing a transaction, and the<br class="" id="yui_3_16_0_1_1447334196169_995962"/>arbiter would set the hooks to a function to send the data there.<br class="" id="yui_3_16_0_1_1447334196169_995964"/>Likewise with the release of the information (normally a very small<br class="" id="yui_3_16_0_1_1447334196169_995966"/>fraction of a second later). The rest of the arbiter code becomes<br class="" id="yui_3_16_0_1_1447334196169_995968"/>a distributed transaction manager. It's not a trivial job to get<br class="" id="yui_3_16_0_1_1447334196169_995970"/>that right, but at least it is a very well-understood problem, and<br class="" id="yui_3_16_0_1_1447334196169_995972"/>is not likely to take as long to develop and shake out tricky<br class="" id="yui_3_16_0_1_1447334196169_995974"/>data-eating bugs.<br class="" id="yui_3_16_0_1_1447334196169_995976" /><br class=""id="yui_3_16_0_1_1447334196169_995978" />--<br class="" id="yui_3_16_0_1_1447334196169_995980" />Kevin Grittner<brclass="" id="yui_3_16_0_1_1447334196169_995982" />EDB: http://www.enterprisedb.com<br class="" id="yui_3_16_0_1_1447334196169_995984"/>The Enterprise PostgreSQL Company<br class="" id="yui_3_16_0_1_1447334196169_995986"/><div dir="ltr"><br /></div></div>
konstantin knizhnik wrote: > The transaction is normally committed in xlog, so that it can always be recovered in case of node fault. > But before setting correspondent bit(s) in CLOG and releasing locks we first contact arbiter to get global status of transaction. > If it is successfully locally committed by all nodes, then arbiter approves commit and commit of transaction normally completed. > Otherwise arbiter rejects commit. In this case DTM marks transaction as aborted in CLOG and returns error to the client. > XLOG is not changed and in case of failure PostgreSQL will try to replay this transaction. > But during recovery it also tries to restore transaction status in CLOG. > And at this placeDTM contacts arbiter to know status of transaction. > If it is marked as aborted in arbiter's CLOG, then it wiull be also marked as aborted in local CLOG. > And according to PostgreSQL visibility rules no other transaction will see changes made by this transaction. One problem I see with this approach is that the WAL replay can happen long after it was written; for instance you might have saved a basebackup and WAL stream and replay it all several days or weeks later, when the arbiter no longer has information about the XID. Later transactions might (will) depend on the aborted state of the transaction in question, so this effectively corrupts the database. In other words, while it's reasonable to require that the arbiter can always be contacted for transaction commit/abort at run time, but it's not reasonable to contact the arbiter during WAL replay. I think this merits more explanation: > The transaction is normally committed in xlog, so that it can always be recovered in case of node fault. Why would anyone want to "recover" a transaction that was aborted? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services