Re: Question concerning XTM (eXtensible Transaction Manager API) - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: Question concerning XTM (eXtensible Transaction Manager API)
Date
Msg-id 1945168568.5065550.1447774054158.JavaMail.yahoo@mail.yahoo.com
Whole thread Raw
In response to Re: Question concerning XTM (eXtensible Transaction Manager API)  (konstantin knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
<div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida
Grande,sans-serif;font-size:16px">On Tuesday, November 17, 2015 12:43 AM, konstantin knizhnik
<k.knizhnik@postgrespro.ru>wrote:<br class="" id="yui_3_16_0_1_1447334196169_995702" />> On Nov 16, 2015, at
11:21PM, Kevin Grittner wrote:<br class="" id="yui_3_16_0_1_1447334196169_995704" /><br class=""
id="yui_3_16_0_1_1447334196169_995706"/>>> If you are saying that DTM tries to roll back a transaction after<br
class=""id="yui_3_16_0_1_1447334196169_995708" />>> any participating server has entered the
RecordTransactionCommit()<brclass="" id="yui_3_16_0_1_1447334196169_995710" />>> critical section, then IMO it is
broken. Full stop.  That can't<br class="" id="yui_3_16_0_1_1447334196169_995712" />>> work with any reasonable
semanticsas far as I can see.<br class="" id="yui_3_16_0_1_1447334196169_995714" />><br class=""
id="yui_3_16_0_1_1447334196169_995716"/>> DTM is not trying to rollback committed transaction.<br class=""
id="yui_3_16_0_1_1447334196169_995718"/>> What he tries to do is to hide this commit.<br class=""
id="yui_3_16_0_1_1447334196169_995720"/>> As I already wrote, the idea was to implement "lightweight" 2PC<br
class=""id="yui_3_16_0_1_1447334196169_995722" />> because prepared transactions mechanism in PostgreSQL adds too
much<brclass="" id="yui_3_16_0_1_1447334196169_995724" />> overhead and cause soe problems with recovery.<br
class=""id="yui_3_16_0_1_1447334196169_995726" /><br class="" id="yui_3_16_0_1_1447334196169_995728" />The point
remainsthat there must be *some* "point of no return"<br class="" id="yui_3_16_0_1_1447334196169_995730" />beyond which
rollback(or "hiding" is not possible).  Until this<br class="" id="yui_3_16_0_1_1447334196169_995732" />point, all
heavyweightlocks held by the transaction must be<br class="" id="yui_3_16_0_1_1447334196169_995734" />maintained
withoutinterruption, data modification of the<br class="" id="yui_3_16_0_1_1447334196169_995736" />transaction must not
bevisible, and any attempt to update or<br class="" id="yui_3_16_0_1_1447334196169_995738" />delete data updated or
deletedby the transaction must block or<br class="" id="yui_3_16_0_1_1447334196169_995740" />throw an error.  It sounds
likeyou are attempting to move the<br class="" id="yui_3_16_0_1_1447334196169_995742" />point at which this "point of
noreturn" is, but it isn't as clear<br class="" id="yui_3_16_0_1_1447334196169_995744" />as I would like.  It seems
likeall participating nodes are<br class="" id="yui_3_16_0_1_1447334196169_995746" />responsible for notifying the
arbiterthat they have completed, and<br class="" id="yui_3_16_0_1_1447334196169_995748" />until then the arbiter gets
involvedin every visibility check,<br class="" id="yui_3_16_0_1_1447334196169_995750" />overriding the "normal"
value?<brclass="" id="yui_3_16_0_1_1447334196169_995752" /><br class="" id="yui_3_16_0_1_1447334196169_995754" />>
Thetransaction is normally committed in xlog, so that it can<br class="" id="yui_3_16_0_1_1447334196169_995756" />>
alwaysbe recovered in case of node fault.<br class="" id="yui_3_16_0_1_1447334196169_995758" />> But before setting
correspondentbit(s) in CLOG and releasing<br class="" id="yui_3_16_0_1_1447334196169_995760" />> locks we first
contactarbiter to get global status of transaction.<br class="" id="yui_3_16_0_1_1447334196169_995762" />> If it is
successfullylocally committed by all nodes, then<br class="" id="yui_3_16_0_1_1447334196169_995764" />> arbiter
approvescommit and commit of transaction normally<br class="" id="yui_3_16_0_1_1447334196169_995766" />>
completed.<brclass="" id="yui_3_16_0_1_1447334196169_995768" />> Otherwise arbiter rejects commit. In this case DTM
marks<brclass="" id="yui_3_16_0_1_1447334196169_995770" />> transaction as aborted in CLOG and returns error to the
client.<brclass="" id="yui_3_16_0_1_1447334196169_995772" />> XLOG is not changed and in case of failure PostgreSQL
willtry to<br class="" id="yui_3_16_0_1_1447334196169_995774" />> replay this transaction.<br class=""
id="yui_3_16_0_1_1447334196169_995776"/>> But during recovery it also tries to restore transaction status<br
class=""id="yui_3_16_0_1_1447334196169_995778" />> in CLOG.<br class="" id="yui_3_16_0_1_1447334196169_995780"
/>>And at this placeDTM contacts arbiter to know status of<br class="" id="yui_3_16_0_1_1447334196169_995782" />>
transaction.<brclass="" id="yui_3_16_0_1_1447334196169_995784" />> If it is marked as aborted in arbiter's CLOG,
thenit wiull be<br class="" id="yui_3_16_0_1_1447334196169_995786" />> also marked as aborted in local CLOG.<br
class=""id="yui_3_16_0_1_1447334196169_995788" />> And according to PostgreSQL visibility rules no other
transaction<brclass="" id="yui_3_16_0_1_1447334196169_995790" />> will see changes made by this transaction.<br
class=""id="yui_3_16_0_1_1447334196169_995792" /><br class="" id="yui_3_16_0_1_1447334196169_995794" />If a node goes
throughcrash and recovery after it has written its<br class="" id="yui_3_16_0_1_1447334196169_995796" />commit
informationto xlog, how are its heavyweight locks, etc.,<br class="" id="yui_3_16_0_1_1447334196169_995798"
/>maintainedthroughout?  For example, does each arbiter node have<br class="" id="yui_3_16_0_1_1447334196169_995800"
/>thecomplete set of heavyweight locks?  (Basically, all the<br class="" id="yui_3_16_0_1_1447334196169_995802"
/>informationwhich can be written to files in pg_twophase must be<br class="" id="yui_3_16_0_1_1447334196169_995804"
/>heldsomewhere by all arbiter nodes, and used where appropriate.)<br class="" id="yui_3_16_0_1_1447334196169_995806"
/><brclass="" id="yui_3_16_0_1_1447334196169_995808" />If a participating node is lost after some other nodes have
told<brclass="" id="yui_3_16_0_1_1447334196169_995810" />the arbiter that they have committed, and the lost node will
never<brclass="" id="yui_3_16_0_1_1447334196169_995812" />be able to indicate that it is committed or rolled back, what
is<brclass="" id="yui_3_16_0_1_1447334196169_995814" />the mechanism for resolving that?<br class=""
id="yui_3_16_0_1_1447334196169_995816"/><br class="" id="yui_3_16_0_1_1447334196169_995818" />>>> We can not
justcall elog(ERROR,...) in SetTransactionStatus<br class="" id="yui_3_16_0_1_1447334196169_995820" />>>>
implementationbecause inside critical section it cause Postgres<br class="" id="yui_3_16_0_1_1447334196169_995822"
/>>>>crash with panic message. So we have to remember that transaction is<br class=""
id="yui_3_16_0_1_1447334196169_995824"/>>>> rejected and report error later after exit from critical
section:<brclass="" id="yui_3_16_0_1_1447334196169_995826" />>><br class=""
id="yui_3_16_0_1_1447334196169_995828"/>>> I don't believe that is a good plan.  You should not enter the<br
class=""id="yui_3_16_0_1_1447334196169_995830" />>> critical section for recording that a commit is complete
untilall<br class="" id="yui_3_16_0_1_1447334196169_995832" />>> the work for the commit is done except for
tellingthe all the<br class="" id="yui_3_16_0_1_1447334196169_995834" />>> servers that all servers are ready.<br
class=""id="yui_3_16_0_1_1447334196169_995836" />><br class="" id="yui_3_16_0_1_1447334196169_995838" />> It is
goodpoint.<br class="" id="yui_3_16_0_1_1447334196169_995840" />> May be it is the reason of performance scalability
problemswe<br class="" id="yui_3_16_0_1_1447334196169_995842" />> have noticed with DTM.<br class=""
id="yui_3_16_0_1_1447334196169_995844"/><br class="" id="yui_3_16_0_1_1447334196169_995846" />Well, certainly the first
phaseof two-phase commit can take place<br class="" id="yui_3_16_0_1_1447334196169_995848" />in parallel, and once that
iscomplete then the second phase<br class="" id="yui_3_16_0_1_1447334196169_995850" />(commit or rollback of all the
participatingprepared transactions)<br class="" id="yui_3_16_0_1_1447334196169_995852" />can take place in parallel. 
Thereis no need to serialize that.<br class="" id="yui_3_16_0_1_1447334196169_995854" /><br class=""
id="yui_3_16_0_1_1447334196169_995856"/>> Sorry, some clarification.<br class=""
id="yui_3_16_0_1_1447334196169_995858"/>> We get 10x slowdown of performance caused by 2pc on very heavy<br class=""
id="yui_3_16_0_1_1447334196169_995860"/>> load on the IBM system with 256 cores.<br class=""
id="yui_3_16_0_1_1447334196169_995862"/>> At "normal" servers slowdown of 2pc is smaller - about 2x.<br class=""
id="yui_3_16_0_1_1447334196169_995864"/><br class="" id="yui_3_16_0_1_1447334196169_995866" />That suggests some
contentionpoint, probably on spinlocks.  Were<br class="" id="yui_3_16_0_1_1447334196169_995868" />you able to identify
theparticular hot spot(s)?<br class="" id="yui_3_16_0_1_1447334196169_995870" /><br class=""
id="yui_3_16_0_1_1447334196169_995872"/><br class="" id="yui_3_16_0_1_1447334196169_995874" />On Tuesday, November 17,
20153:09 AM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:<br class=""
id="yui_3_16_0_1_1447334196169_995876"/>> On Nov 17, 2015, at 10:44 AM, Amit Kapila wrote:<br class=""
id="yui_3_16_0_1_1447334196169_995878"/><br class="" id="yui_3_16_0_1_1447334196169_995880" />>> I think the
generalidea is that if Commit is WAL logged, then the<br class="" id="yui_3_16_0_1_1447334196169_995882" />>>
operationis considered to committed on local node and commit should<br class="" id="yui_3_16_0_1_1447334196169_995884"
/>>>happen on any node, only once prepare from all nodes is successful.<br class=""
id="yui_3_16_0_1_1447334196169_995886"/>>> And after that transaction is not supposed to abort.  But I think you
are<brclass="" id="yui_3_16_0_1_1447334196169_995888" />>> trying to optimize the DTM in some way to not follow
thatkind of protocol.<br class="" id="yui_3_16_0_1_1447334196169_995890" />><br class=""
id="yui_3_16_0_1_1447334196169_995892"/>> DTM is still following 2PC protocol:<br class=""
id="yui_3_16_0_1_1447334196169_995894"/>> First transaction is saved in WAL at all nodes and only after it<br
class=""id="yui_3_16_0_1_1447334196169_995896" />> commit is completed at all nodes.<br class=""
id="yui_3_16_0_1_1447334196169_995898"/><br class="" id="yui_3_16_0_1_1447334196169_995900" />So, essentially you are
treatingthe traditional commit point as<br class="" id="yui_3_16_0_1_1447334196169_995902" />phase 1 in a new approach
totwo-phase commit, and adding another<br class="" id="yui_3_16_0_1_1447334196169_995904" />layer to override normal
visibilitychecking and record locks<br class="" id="yui_3_16_0_1_1447334196169_995906" />(etc.) past that point?<br
class=""id="yui_3_16_0_1_1447334196169_995908" /><br class="" id="yui_3_16_0_1_1447334196169_995910" />> We try to
avoidmaintaining of separate log files for 2PC (as now<br class="" id="yui_3_16_0_1_1447334196169_995912" />> for
preparedtransactions) and do not want to change logic of<br class="" id="yui_3_16_0_1_1447334196169_995914" />> work
withWAL.<br class="" id="yui_3_16_0_1_1447334196169_995916" />><br class="" id="yui_3_16_0_1_1447334196169_995918"
/>>DTM approach is based on the assumption that PostgreSQL CLOG and<br class=""
id="yui_3_16_0_1_1447334196169_995920"/>> visibility rules allows to "hide" transaction even if it is<br class=""
id="yui_3_16_0_1_1447334196169_995922"/>> committed in WAL.<br class="" id="yui_3_16_0_1_1447334196169_995924" /><br
class=""id="yui_3_16_0_1_1447334196169_995926" />I see where you could get a performance benefit from not recording<br
class=""id="yui_3_16_0_1_1447334196169_995928" />(and cleaning up) persistent state for a transaction in the<br
class=""id="yui_3_16_0_1_1447334196169_995930" />pg_twophase directory between the time the transaction is prepared<br
class=""id="yui_3_16_0_1_1447334196169_995932" />and when it is committed (which should normally be a very short<br
class=""id="yui_3_16_0_1_1447334196169_995934" />period of time, but must survive crashes, and communication<br
class=""id="yui_3_16_0_1_1447334196169_995936" />failures).  Essentially you are trying to keep that in RAM instead,<br
class=""id="yui_3_16_0_1_1447334196169_995938" />and counting on multiple processes at different locations<br class=""
id="yui_3_16_0_1_1447334196169_995940"/>redundantly (and synchronously) storing this data to ensure<br class=""
id="yui_3_16_0_1_1447334196169_995942"/>persistence, rather than writing the data to disk files when are<br class=""
id="yui_3_16_0_1_1447334196169_995944"/>deleted as soon as the prepared transaction is committed or rolled<br class=""
id="yui_3_16_0_1_1447334196169_995946"/>back.<br class="" id="yui_3_16_0_1_1447334196169_995948" /><br class=""
id="yui_3_16_0_1_1447334196169_995950"/>I wonder whether it might not be safer to just do that -- rather<br class=""
id="yui_3_16_0_1_1447334196169_995952"/>than trying to develop a whole new way of implementing two-phase<br class=""
id="yui_3_16_0_1_1447334196169_995954"/>commit, just come up with a new way to persist the information<br class=""
id="yui_3_16_0_1_1447334196169_995956"/>which must survive between the prepared and the later commit or<br class=""
id="yui_3_16_0_1_1447334196169_995958"/>rollback of the prepared transaction.  Essentially, provide hooks<br class=""
id="yui_3_16_0_1_1447334196169_995960"/>for persisting the data when preparing a transaction, and the<br class=""
id="yui_3_16_0_1_1447334196169_995962"/>arbiter would set the hooks to a function to send the data there.<br class=""
id="yui_3_16_0_1_1447334196169_995964"/>Likewise with the release of the information (normally a very small<br class=""
id="yui_3_16_0_1_1447334196169_995966"/>fraction of a second later).  The rest of the arbiter code becomes<br class=""
id="yui_3_16_0_1_1447334196169_995968"/>a distributed transaction manager.  It's not a trivial job to get<br class=""
id="yui_3_16_0_1_1447334196169_995970"/>that right, but at least it is a very well-understood problem, and<br class=""
id="yui_3_16_0_1_1447334196169_995972"/>is not likely to take as long to develop and shake out tricky<br class=""
id="yui_3_16_0_1_1447334196169_995974"/>data-eating bugs.<br class="" id="yui_3_16_0_1_1447334196169_995976" /><br
class=""id="yui_3_16_0_1_1447334196169_995978" />--<br class="" id="yui_3_16_0_1_1447334196169_995980" />Kevin
Grittner<brclass="" id="yui_3_16_0_1_1447334196169_995982" />EDB: http://www.enterprisedb.com<br class=""
id="yui_3_16_0_1_1447334196169_995984"/>The Enterprise PostgreSQL Company<br class=""
id="yui_3_16_0_1_1447334196169_995986"/><div dir="ltr"><br /></div></div> 

pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Extracting fields from 'infinity'::TIMESTAMP[TZ]
Next
From: Jim Nasby
Date:
Subject: Re: Freeze avoidance of very large table.