Thread: eXtensible Transaction Manager API
Hi, PostgresPro cluster team wants to announce proposal for eXtensible Transaction Manager API and reference implementation ofdistributed transaction manager (pg_dtm). pg_dtm is just a standard PostgreSQL extension which should be installed in normal way. Source of pg_dtm and PostgreSQL patches are available here: https://github.com/postgrespro/pg_dtm WiKi page: https://wiki.postgresql.org/wiki/DTM xtm.patch patches PostgreSQL core by replacing direct calls of 7 TM functions with indirect calls and adding 3 addition eventsto transaction commit callbacks: postgres_fdw.patch patches postgres_fdw extension to work with DTM We have also pgDTM implementation which is using timestamps (system time) as CSNs. It is also based on the same XTM transaction API. We will publish it later once we clarify problems with recovery and performance with this approach. Also we have patch for pg_shard to work with DTM, which also will be published soon. Waiting for your feedback Konstantin Knizhnik, Postgres Professional
Attachment
On 31 October 2015 at 17:22, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote: > Waiting for your feedback For anyone wondering about performance impact, there are some graphs on page 23 of the PDF presentation. I didn't see anything else, and the graphs don't seem to cover comparison of Pg with the XTM transaction manager hooks and no DTM enabled vs Pg without the hooks, i.e. the hook overhead its self. Have you done much work on that? Personally I wouldn't expect to see any meaningful overhead, but I'd really like to have numbers behind that. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 31 October 2015 at 10:22, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
--
Hi,
PostgresPro cluster team wants to announce proposal for eXtensible Transaction Manager API and reference implementation of distributed transaction manager (pg_dtm).
pg_dtm is just a standard PostgreSQL extension which should be installed in normal way.
Source of pg_dtm and PostgreSQL patches are available here: https://github.com/postgrespro/pg_dtm
WiKi page: https://wiki.postgresql.org/wiki/DTM
Very interesting work.
xtm.patch patches PostgreSQL core by replacing direct calls of 7 TM functions with indirect calls and
At first I was concerned about recovery, but that looks to be covered.
PostgreSQL assumes that top-level xid commit is atomic, along with all of its subtransactions. So the API having access to only Get/Set at the xid level would not work. We would need TransactionIdSetTreeStatus() rather than just SetStatus. GetStatus is not atomic.
adding 3 addition events to transaction commit callbacks:
Those look uncontentious, so we should add those anyway in a sub-patch.
We have also pgDTM implementation which is using timestamps (system time) as CSNs.
It is also based on the same XTM transaction API.
We will publish it later once we clarify problems with recovery and performance with this approach.
That is most interesting part, so it needs to be published.
At present, I can't tell whether you need subtrans and multixact calls in the API as well. I would imagine we probably do, though we might imagine an implementation that didn't support those concepts.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 02.11.2015 06:17, Craig Ringer wrote: > On 31 October 2015 at 17:22, konstantin knizhnik > <k.knizhnik@postgrespro.ru> wrote: > >> Waiting for your feedback > For anyone wondering about performance impact, there are some graphs > on page 23 of the PDF presentation. I didn't see anything else, and > the graphs don't seem to cover comparison of Pg with the XTM > transaction manager hooks and no DTM enabled vs Pg without the hooks, > i.e. the hook overhead its self. > > Have you done much work on that? Personally I wouldn't expect to see > any meaningful overhead, but I'd really like to have numbers behind > that. > Overhead of indirect call is negligible - see for example https://gist.github.com/rianhunter/0be8dc116b120ad5fdd4 But we have certainly performed comparison of PostgreSQL with/without XTM patch. Pgbench results are almost the same - within the measurement error: With XTM: transaction type: TPC-B (sort of) scaling factor: 70 query mode: simple number of clients: 144 number of threads: 24 duration: 600 s number of transactions actually processed: 12275179 latency average: 7.037 ms latency stddev: 46.787 ms tps = 20456.945469 (including connections establishing) tps = 20457.164023 (excluding connections establishing) Without XTM: transaction type: TPC-B (sort of) scaling factor: 70 query mode: simple number of clients: 144 number of threads: 24 duration: 600 s number of transactions actually processed: 12086367 latency average: 7.156 ms latency stddev: 48.431 ms tps = 20114.491785 (including connections establishing) tps = 20116.074391 (excluding connections establishing)
On 02.11.2015 12:01, Simon Riggs wrote:
Yes, we have not considered all possible scenarios of working with PostgreSQL.At first I was concerned about recovery, but that looks to be covered.
We have tested work with different isolation levels: repeatable read and read committed,
but we have not tested "select for update" and explicit locking.
PostgreSQL assumes that top-level xid commit is atomic, along with all of its subtransactions. So the API having access to only Get/Set at the xid level would not work. We would need TransactionIdSetTreeStatus() rather than just SetStatus. GetStatus is not atomic.
XTM API has function SetTransactionStatus but it actually encapsulate TransactionIdSetTreeStatus.
adding 3 addition events to transaction commit callbacks:Those look uncontentious, so we should add those anyway in a sub-patch.We have also pgDTM implementation which is using timestamps (system time) as CSNs.
It is also based on the same XTM transaction API.
We will publish it later once we clarify problems with recovery and performance with this approach.That is most interesting part, so it needs to be published.
Will be available during this week.
At present, I can't tell whether you need subtrans and multixact calls in the API as well. I would imagine we probably do, though we might imagine an implementation that didn't support those concepts.
Postgres-XL doesn't support subtransactions. Neither did we at this moment.
Support of subtransactions will be our next step.
On 2 November 2015 at 13:24, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
--
PostgreSQL assumes that top-level xid commit is atomic, along with all of its subtransactions. So the API having access to only Get/Set at the xid level would not work. We would need TransactionIdSetTreeStatus() rather than just SetStatus. GetStatus is not atomic.
XTM API has function SetTransactionStatus but it actually encapsulate TransactionIdSetTreeStatus.
OK, thanks.
At present, I can't tell whether you need subtrans and multixact calls in the API as well. I would imagine we probably do, though we might imagine an implementation that didn't support those concepts.
Postgres-XL doesn't support subtransactions. Neither did we at this moment.
Support of subtransactions will be our next step.
OK. I guess it depends where we want to put the patch; discussion here means in-core, so for it to go in here, we do need subxacts etc.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 31, 2015 at 2:52 PM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote:
Hi,
PostgresPro cluster team wants to announce proposal for eXtensible Transaction Manager API and reference implementation of distributed transaction manager (pg_dtm).
pg_dtm is just a standard PostgreSQL extension which should be installed in normal way.
Source of pg_dtm and PostgreSQL patches are available here: https://github.com/postgrespro/pg_dtm
WiKi page: https://wiki.postgresql.org/wiki/DTM
xtm.patch patches PostgreSQL core by replacing direct calls of 7 TM functions with indirect calls and adding 3 addition events to transaction commit callbacks:
postgres_fdw.patch patches postgres_fdw extension to work with DTM
We have also pgDTM implementation which is using timestamps (system time) as CSNs.
It is also based on the same XTM transaction API.
We will publish it later once we clarify problems with recovery and performance with this approach.
Also we have patch for pg_shard to work with DTM, which also will be published soon.
Today, while studying your proposal and related material, I noticed
that in both the approaches DTM and tsDTM, you are talking about
committing a transaction and acquiring the snapshot consistently, but
not touched upon the how the locks will be managed across nodes and
how deadlock detection across nodes will work. This will also be one
of the crucial points in selecting one of the approaches. Also I have
noticed that discussion about Rollback is not there, example how will
Rollback happen with API's provided in your second approach (tsDTM)?
Similarly, having some discussion on parts of recovery that could be affected
would be great.
I think in this patch, it is important to see the completeness of all the
API's that needs to be exposed for the implementation of distributed
transactions and the same is difficult to visualize without having complete
picture of all the components that has some interaction with the distributed
transaction system. On the other hand we can do it in incremental fashion
as and when more parts of the design are clear.
Hi,
Thank you for your feedback.
My comments are inside.
On 11/07/2015 05:11 PM, Amit Kapila wrote:
Thank you for your feedback.
My comments are inside.
On 11/07/2015 05:11 PM, Amit Kapila wrote:
Today, while studying your proposal and related material, I noticedthat in both the approaches DTM and tsDTM, you are talking aboutcommitting a transaction and acquiring the snapshot consistently, butnot touched upon the how the locks will be managed across nodes andhow deadlock detection across nodes will work. This will also be oneof the crucial points in selecting one of the approaches.
Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of distributed deadlock?
2. Should DLM be part of XTM API or it should be separate API?
3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).
4. How to globally identify resource owners (0transactions) in global lock graph. In case of DTM we have global (shared) XIDs,
and in tsDTM - global transactions IDs, assigned by application (which is not so clear how to retrieve).
In other cases we may need to have local->global transaction id mapping, so looks like DLM should be part of DTM...
Also I havenoticed that discussion about Rollback is not there, example how willRollback happen with API's provided in your second approach (tsDTM)?
In tsDTM approach two phase commit is performed by coordinator and currently is using standard PostgreSQL two phase commit:
Code in GO performing two phase commit:
exec(conn1, "prepare transaction '" + gtid + "'")
exec(conn2, "prepare transaction '" + gtid + "'")
exec(conn1, "select dtm_begin_prepare($1)", gtid)
exec(conn2, "select dtm_begin_prepare($1)", gtid)
csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn)
exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn1, "commit prepared '" + gtid + "'")
exec(conn2, "commit prepared '" + gtid + "'")
If commit at some of the nodes failed, coordinator should rollback prepared transaction at all nodes.
Similarly, having some discussion on parts of recovery that could be affectedwould be great.
We are currently implementing fault tolerance and recovery for DTM approach (with centralized arbiter).
There are several replicas of arbiter, synchronized using RAFT protocol.
But with tsDTM approach recovery model is still obscure...
We are thinking about it.
I think in this patch, it is important to see the completeness of all theAPI's that needs to be exposed for the implementation of distributedtransactions and the same is difficult to visualize without having completepicture of all the components that has some interaction with the distributedtransaction system. On the other hand we can do it in incremental fashionas and when more parts of the design are clear.
That is exactly what we are going to do - we are trying to integrate DTM with existed systems (pg_shard, postgres_fdw, BDR) and find out what is missed and should be added. In parallel we are trying to compare efficiency and scalability of different solutions.
For example we still considering scalability problems with tsDTM approach: to provide acceptable performance, it requires very precise clock synchronization (we have to use PTP instead of NTP). So it may be waste of time trying to provide fault tolerance for tsDTM if we finally found out that this approach can not provide better scalability than simpler DTM approach.
On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote: > In tsDTM approach two phase commit is performed by coordinator and currently > is using standard PostgreSQL two phase commit: > > Code in GO performing two phase commit: > > exec(conn1, "prepare transaction '" + gtid + "'") > exec(conn2, "prepare transaction '" + gtid + "'") > exec(conn1, "select dtm_begin_prepare($1)", gtid) > exec(conn2, "select dtm_begin_prepare($1)", gtid) > csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid) > csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn) > exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn) > exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn) > exec(conn1, "commit prepared '" + gtid + "'") > exec(conn2, "commit prepared '" + gtid + "'") > > If commit at some of the nodes failed, coordinator should rollback prepared > transaction at all nodes. Not always. If COMMIT PREPARED fails at some of the nodes but succeeds on others, the transaction is already partially acknowledged as committed in the cluster. Hence it makes more sense for the coordinator to commit transactions on the remaining nodes. Before issuing any COMMIT PREPARED queries, I guess that's fine to rollback the transactions on all nodes though. -- Michael
On 7 November 2015 at 16:53, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
--
On 11/07/2015 05:11 PM, Amit Kapila wrote:Today, while studying your proposal and related material, I noticedthat in both the approaches DTM and tsDTM, you are talking aboutcommitting a transaction and acquiring the snapshot consistently, butnot touched upon the how the locks will be managed across nodes andhow deadlock detection across nodes will work. This will also be oneof the crucial points in selecting one of the approaches.
Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of distributed deadlock?
2. Should DLM be part of XTM API or it should be separate API?
3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).
4. How to globally identify resource owners (0transactions) in global lock graph. In case of DTM we have global (shared) XIDs,
and in tsDTM - global transactions IDs, assigned by application (which is not so clear how to retrieve).
In other cases we may need to have local->global transaction id mapping, so looks like DLM should be part of DTM...
Yes, we need a Distributed Lock Manager, but I think its a separate thing from the DTM.
(I'm loath to use the phase DLM, which was used within Oracle Parallel server for a buffer lock manager, which is not what is being discussed).
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/08/2015 02:46 PM, Michael Paquier wrote: > On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote: >> In tsDTM approach two phase commit is performed by coordinator and currently >> is using standard PostgreSQL two phase commit: >> >> Code in GO performing two phase commit: >> >> exec(conn1, "prepare transaction '" + gtid + "'") >> exec(conn2, "prepare transaction '" + gtid + "'") >> exec(conn1, "select dtm_begin_prepare($1)", gtid) >> exec(conn2, "select dtm_begin_prepare($1)", gtid) >> csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid) >> csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn) >> exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn) >> exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn) >> exec(conn1, "commit prepared '" + gtid + "'") >> exec(conn2, "commit prepared '" + gtid + "'") >> >> If commit at some of the nodes failed, coordinator should rollback prepared >> transaction at all nodes. > Not always. If COMMIT PREPARED fails at some of the nodes but succeeds > on others, the transaction is already partially acknowledged as > committed in the cluster. Hence it makes more sense for the > coordinator to commit transactions on the remaining nodes. Before > issuing any COMMIT PREPARED queries, I guess that's fine to rollback > the transactions on all nodes though. We will get inconsistency if transaction is committed on some subset of nodes involved in transaction. Assume bank debit-credit example. If some distributed transaction transfers money from the account at one node to the accountand another node, then committing transaction just at one node cause incorrect total balance. The main goal of DTM is to preserve ACID semantic for distributed transaction, so either transaction is committed at allnodes, either it is not committed at all.
On 8 November 2015 at 16:59, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
--
On 11/08/2015 02:46 PM, Michael Paquier wrote:On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote:We will get inconsistency if transaction is committed on some subset of nodes involved in transaction.In tsDTM approach two phase commit is performed by coordinator and currentlyNot always. If COMMIT PREPARED fails at some of the nodes but succeeds
is using standard PostgreSQL two phase commit:
Code in GO performing two phase commit:
exec(conn1, "prepare transaction '" + gtid + "'")
exec(conn2, "prepare transaction '" + gtid + "'")
exec(conn1, "select dtm_begin_prepare($1)", gtid)
exec(conn2, "select dtm_begin_prepare($1)", gtid)
csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid)
csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, csn)
exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn)
exec(conn1, "commit prepared '" + gtid + "'")
exec(conn2, "commit prepared '" + gtid + "'")
If commit at some of the nodes failed, coordinator should rollback prepared
transaction at all nodes.
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.
Assume bank debit-credit example. If some distributed transaction transfers money from the account at one node to the account and another node,
then committing transaction just at one node cause incorrect total balance.
The main goal of DTM is to preserve ACID semantic for distributed transaction, so either transaction is committed at all nodes, either it is not committed at all.
Agreed.
COMMIT PREPARED is a pretty thin layer; the work is all done in the PREPARE. With a DTM, the main commit itself is done once only in the DTM, so all the COMMIT PREPARED would do is release local node resources.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 9, 2015 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 8 November 2015 at 16:59, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> > wrote: >> >> On 11/08/2015 02:46 PM, Michael Paquier wrote: >>> >>> On Sun, Nov 8, 2015 at 1:53 AM, Konstantin Knizhnik wrote: >>>> >>>> In tsDTM approach two phase commit is performed by coordinator and >>>> currently >>>> is using standard PostgreSQL two phase commit: >>>> >>>> Code in GO performing two phase commit: >>>> >>>> exec(conn1, "prepare transaction '" + gtid + "'") >>>> exec(conn2, "prepare transaction '" + gtid + "'") >>>> exec(conn1, "select dtm_begin_prepare($1)", gtid) >>>> exec(conn2, "select dtm_begin_prepare($1)", gtid) >>>> csn = _execQuery(conn1, "select dtm_prepare($1, 0)", gtid) >>>> csn = _execQuery(conn2, "select dtm_prepare($1, $2)", gtid, >>>> csn) >>>> exec(conn1, "select dtm_end_prepare($1, $2)", gtid, csn) >>>> exec(conn2, "select dtm_end_prepare($1, $2)", gtid, csn) >>>> exec(conn1, "commit prepared '" + gtid + "'") >>>> exec(conn2, "commit prepared '" + gtid + "'") >>>> >>>> If commit at some of the nodes failed, coordinator should rollback >>>> prepared >>>> transaction at all nodes. >>> >>> Not always. If COMMIT PREPARED fails at some of the nodes but succeeds >>> on others, the transaction is already partially acknowledged as >>> committed in the cluster. Hence it makes more sense for the >>> coordinator to commit transactions on the remaining nodes. Before >>> issuing any COMMIT PREPARED queries, I guess that's fine to rollback >>> the transactions on all nodes though. >> >> We will get inconsistency if transaction is committed on some subset of >> nodes involved in transaction. >> Assume bank debit-credit example. If some distributed transaction >> transfers money from the account at one node to the account and another >> node, >> then committing transaction just at one node cause incorrect total >> balance. >> The main goal of DTM is to preserve ACID semantic for distributed >> transaction, so either transaction is committed at all nodes, either it is >> not committed at all. > > > Agreed. > > COMMIT PREPARED is a pretty thin layer; the work is all done in the PREPARE. > With a DTM, the main commit itself is done once only in the DTM, so all the > COMMIT PREPARED would do is release local node resources. Sure. Now imagine that the pg_twophase entry is corrupted for this transaction on one node. This would trigger a PANIC on it, and transaction would not be committed everywhere. I am aware of the fact that by definition PREPARE TRANSACTION ensures that a transaction will be committed with COMMIT PREPARED, just trying to see any corner cases with the approach proposed. The DTM approach is actually rather close to what a GTM in Postgres-XC does :) -- Michael
On Sat, Nov 7, 2015 at 10:23 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of distributed deadlock?
I think so. Basically DLM should be responsible for maintaining
all the lock information which inturn means that any backend process
that needs to acquire/release lock needs to interact with DLM, without that
I don't think even global deadlock detection can work (to detect deadlocks
among all the nodes, it needs to know the lock info of all nodes).
This is somewhat inline with what currently we do during lock conflicts,
i.e if the backend incur a lock conflict and decides to sleep, before
sleeping it tries to detect an early deadlock, so if we make DLM responsible
for managing locks the same could be even achieved in distributed system.
2. Should DLM be part of XTM API or it should be separate API?
We might be able to do it either ways (make it part of XTM API or devise a
separate API). I think here more important point is to first get the high level
design for Distributed Transactions (which primarily includes consistent
Commit/Rollback, snapshots, distributed lock manager (DLM) and recovery).
3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).
That's important decision. I think it will depend on which kind of design
we choose for distributed transaction manager (arbiter based solution or
non-arbiter based solution, something like tsDTM). I think DLM should be
separate, else arbiter will become hot-spot with respect to contention.
4. How to globally identify resource owners (0transactions) in global lock graph. In case of DTM we have global (shared) XIDs,
and in tsDTM - global transactions IDs, assigned by application (which is not so clear how to retrieve).
In other cases we may need to have local->global transaction id mapping, so looks like DLM should be part of DTM...
I think the DLM should in itself have all the necessary information to find
deadlocks or anything else required by locking system. For what kind of
cases, do you envision to identify global resource owners?
And I think if we require some interaction between DLM and DTM, then we
can implement the same, rather than making DLM part of DTM.
Also I havenoticed that discussion about Rollback is not there, example how willRollback happen with API's provided in your second approach (tsDTM)?
In tsDTM approach two phase commit is performed by coordinator and currently is using standard PostgreSQL two phase commit:
If commit at some of the nodes failed, coordinator should rollback prepared transaction at all nodes.
Can you please explain more about tsDTM approach, how timestamps
are used, what is exactly CSN (is it Commit Sequence Number) and
how it is used in prepare phase? IS CSN used as timestamp?
Is the coordinator also one of the PostgreSQL instance?
I think in this patch, it is important to see the completeness of all theAPI's that needs to be exposed for the implementation of distributedtransactions and the same is difficult to visualize without having completepicture of all the components that has some interaction with the distributedtransaction system. On the other hand we can do it in incremental fashionas and when more parts of the design are clear.
That is exactly what we are going to do - we are trying to integrate DTM with existed systems (pg_shard, postgres_fdw, BDR) and find out what is missed and should be added. In parallel we are trying to compare efficiency and scalability of different solutions.
One thing, I have noticed that in DTM approach, you seems to
be considering a centralized (Arbiter) transaction manager and
centralized Lock manager which seems to be workable approach,
but I think such an approach won't scale in write-heavy or mixed
read-write workload. Have you thought about distributing responsibility
of global transaction and lock management? I think such a system
might be somewhat difficult to design, but the scalability will be better.
On 09.11.2015 07:46, Amit Kapila wrote:
On Sat, Nov 7, 2015 at 10:23 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of distributed deadlock?I think so. Basically DLM should be responsible for maintainingall the lock information which inturn means that any backend processthat needs to acquire/release lock needs to interact with DLM, without thatI don't think even global deadlock detection can work (to detect deadlocksamong all the nodes, it needs to know the lock info of all nodes).
I hope that it will not needed, otherwise it will add significant performance penalty.
Unless I missed something, locks can still managed locally, but we need DLM to detect global deadlocks.
Deadlock detection in PostgreSQL is performed only after timeout expiration for lock waiting. Only then lock graph is analyzed for presence of loops. At this moment we can send our local lock graph to DLM. If it is really distributed deadlock, then sooner or later
all nodes involved in this deadlock will send their local graphs to DLM and DLM will be able to build global graph and detect distributed deadlock. In this scenario DLM will be accessed very rarely - only when lock can not be granted within deadlock_timeout.
One of the possible problems is false global deadlock detection, because local lock graphs corresponds to different moments of time, so we can found "false" loop. I am still thinking about best solution of this problem.
This is somewhat inline with what currently we do during lock conflicts,i.e if the backend incur a lock conflict and decides to sleep, beforesleeping it tries to detect an early deadlock, so if we make DLM responsiblefor managing locks the same could be even achieved in distributed system.2. Should DLM be part of XTM API or it should be separate API?We might be able to do it either ways (make it part of XTM API or devise aseparate API). I think here more important point is to first get the high leveldesign for Distributed Transactions (which primarily includes consistentCommit/Rollback, snapshots, distributed lock manager (DLM) and recovery).3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).That's important decision. I think it will depend on which kind of designwe choose for distributed transaction manager (arbiter based solution ornon-arbiter based solution, something like tsDTM). I think DLM should beseparate, else arbiter will become hot-spot with respect to contention.
There are pros and contras.
Pros for integrating DLM in DTM:
1. DTM arbiter has information about local to global transaction ID mapping which may be needed to DLM
2. If my assumptions about DLM are correct, then it will be accessed relatively rarely and should not cause significant
impact on performance.
Contras:
1. tsDTM doesn't need centralized arbiter but still needs DLM
2. Logically DLM and DTM are independent components
Can you please explain more about tsDTM approach, how timestampsare used, what is exactly CSN (is it Commit Sequence Number) andhow it is used in prepare phase? IS CSN used as timestamp?Is the coordinator also one of the PostgreSQL instance?
In tsDTM approach system time (in microseconds) is used as CSN (commit sequence number).
We also enforce that assigned CSN is unique and monotonic within PostgreSQL instance.
CSN are assigned locally and do not require interaction with some other cluster nodes.
This is why in theory tsDTM approach should provide good scalability.
I think in this patch, it is important to see the completeness of all theAPI's that needs to be exposed for the implementation of distributedtransactions and the same is difficult to visualize without having completepicture of all the components that has some interaction with the distributedtransaction system. On the other hand we can do it in incremental fashionas and when more parts of the design are clear.
That is exactly what we are going to do - we are trying to integrate DTM with existed systems (pg_shard, postgres_fdw, BDR) and find out what is missed and should be added. In parallel we are trying to compare efficiency and scalability of different solutions.One thing, I have noticed that in DTM approach, you seems tobe considering a centralized (Arbiter) transaction manager andcentralized Lock manager which seems to be workable approach,but I think such an approach won't scale in write-heavy or mixedread-write workload. Have you thought about distributing responsibilityof global transaction and lock management?
From my point of view there are two different scenarios:
1. When most transactions are local and only few of them are global (for example most operation in branch of the back are
performed with accounts of clients of this branch, but there are few transactions involved accounts from different branches).
2. When most or all transactions are global.
It seems to me that first approach is more popular in real life and actually good performance of distributed system can be achieved only in case when most transaction are local (involves only one node). There are several approaches allowing to optimize local transactions. For example once used in SAP HANA (http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf)
We have also DTM implementation based on this approach but it is not yet working.
If most of transaction are global, them affect random subsets of cluster nodes (so it is not possible to logically split cluster into groups of tightly coupled nodes) and number of nodes is not very large (<10) then I do not think that there can be better alternative (from performance point of view) than centralized arbiter.
But it is only my speculations and it will be really very interesting for me to know access patterns of real customers, using distributed systems.
I think such a systemmight be somewhat difficult to design, but the scalability will be better.
On 8 November 2015 at 23:35, Michael Paquier <michael.paquier@gmail.com> wrote:
--
Sure. Now imagine that the pg_twophase entry is corrupted for this> COMMIT PREPARED is a pretty thin layer; the work is all done in the PREPARE.
> With a DTM, the main commit itself is done once only in the DTM, so all the
> COMMIT PREPARED would do is release local node resources.
transaction on one node. This would trigger a PANIC on it, and
transaction would not be committed everywhere. I am aware of the fact
that by definition PREPARE TRANSACTION ensures that a transaction will
be committed with COMMIT PREPARED, just trying to see any corner cases
with the approach proposed. The DTM approach is actually rather close
to what a GTM in Postgres-XC does :)
This is wrong.
There is no "approach proposed", this is just normal 2PC feature that PostgreSQL has supported since 8.1. All that is proposed here is that we have an API that allows this to be exposed.
The whole point of PREPARE is that it allows a database to crash after that step and it can still be recovered. That has nothing to do with Xanything.
In this case, the role of the GTM is to record the commit. Other nodes consult the GTM, not local state to see whether the transaction has committed, acting as the transaction manager in an XA sense.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Sure. Now imagine that the pg_twophase entry is corrupted for this > transaction on one node. This would trigger a PANIC on it, and > transaction would not be committed everywhere. If the database is corrupted, there's no way to guarantee that anything works as planned. This is like saying that criticizing somebody's disaster recovery plan on the basis that it will be inadequate if the entire planet earth is destroyed. > I am aware of the fact > that by definition PREPARE TRANSACTION ensures that a transaction will > be committed with COMMIT PREPARED, just trying to see any corner cases > with the approach proposed. The DTM approach is actually rather close > to what a GTM in Postgres-XC does :) Yes. I think that we should try to learn as much as possible from the XC experience, but that doesn't mean we should incorporate XC's fuzzy thinking about 2PC into PG. We should not. One point I'd like to mention is that it's absolutely critical to design this in a way that minimizes network roundtrips without compromising correctness. XC's GTM proxy suggests that they failed to do that. I think we really need to look at what's going to be on the other sides of the proposed APIs and think about whether it's going to be possible to have a strong local caching layer that keeps network roundtrips to a minimum. We should consider whether the need for such a caching layer has any impact on what the APIs should look like. For example, consider a 10-node cluster where each node has 32 cores and 32 clients, and each client is running lots of short-running SQL statements. The demand for snapshots will be intense. If every backend separately requests a snapshot for every SQL statement from the coordinator, that's probably going to be terrible. We can make it the problem of the stuff behind the DTM API to figure out a way to avoid that, but maybe that's going to result in every DTM needing to solve the same problems. Another thing that I think we need to consider is fault-tolerance. For example, suppose that transaction commit and snapshot coordination services are being provided by a central server which keeps track of the global commit ordering. When that server gets hit by a freak bold of lightning and melted into a heap of slag, somebody else needs to take over. Again, this would live below the API proposed here, but I think it really deserves some thought before we get too far down the path. XC didn't start thinking about how to add fault-tolerance until quite late in the project, I think, and the same could be said of PostgreSQL itself: some newer systems have easier-to-use fault tolerance mechanisms because it was designed in from the beginning. Distributed systems by nature need high availability to a far greater degree than single systems, because when there are more nodes, node failures are more frequent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 November 2015 at 18:46, Robert Haas <robertmhaas@gmail.com> wrote:
--
> I am aware of the fact
> that by definition PREPARE TRANSACTION ensures that a transaction will
> be committed with COMMIT PREPARED, just trying to see any corner cases
> with the approach proposed. The DTM approach is actually rather close
> to what a GTM in Postgres-XC does :)
Yes. I think that we should try to learn as much as possible from the
XC experience, but that doesn't mean we should incorporate XC's fuzzy
thinking about 2PC into PG. We should not.
Fuzzy thinking. Please explain.
One point I'd like to mention is that it's absolutely critical to
design this in a way that minimizes network roundtrips without
compromising correctness. XC's GTM proxy suggests that they failed to
do that. I think we really need to look at what's going to be on the
other sides of the proposed APIs and think about whether it's going to
be possible to have a strong local caching layer that keeps network
roundtrips to a minimum. We should consider whether the need for such
a caching layer has any impact on what the APIs should look like.
You mean the caching layer that already exists in XL/XC?
For example, consider a 10-node cluster where each node has 32 cores
and 32 clients, and each client is running lots of short-running SQL
statements. The demand for snapshots will be intense. If every
backend separately requests a snapshot for every SQL statement from
the coordinator, that's probably going to be terrible. We can make it
the problem of the stuff behind the DTM API to figure out a way to
avoid that, but maybe that's going to result in every DTM needing to
solve the same problems.
The whole purpose of that XTM API is to allow different solutions for that to be created. Konstantin has made a very good case for such an API to exist, based around 3 markedly different approaches.
Whether we allow the API into core to be accessible via extensions is a different issue, but it looks fine for its purpose.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 9, 2015 at 2:08 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
On 09.11.2015 07:46, Amit Kapila wrote:I think so. Basically DLM should be responsible for maintainingall the lock information which inturn means that any backend processthat needs to acquire/release lock needs to interact with DLM, without thatI don't think even global deadlock detection can work (to detect deadlocksamong all the nodes, it needs to know the lock info of all nodes).
I hope that it will not needed, otherwise it will add significant performance penalty.
Unless I missed something, locks can still managed locally, but we need DLM to detect global deadlocks.
How will you check lock conflicts, example Process A on node-1
tries to acquire lock on object-1, but Process B from node-2 already
holds a conflicting lock on the object-1. Now if try to think in a way
that we will have an entry of lock request from node-2 in node-1, then
I think it will be difficult to manage and release locks.
3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).That's important decision. I think it will depend on which kind of designwe choose for distributed transaction manager (arbiter based solution ornon-arbiter based solution, something like tsDTM). I think DLM should beseparate, else arbiter will become hot-spot with respect to contention.
There are pros and contras.
Pros for integrating DLM in DTM:
1. DTM arbiter has information about local to global transaction ID mapping which may be needed to DLM
2. If my assumptions about DLM are correct, then it will be accessed relatively rarely and should not cause significant
impact on performance.
Yeah, if the usage of DLM is relatively less, then it can make sense
to club them, but otherwise it doesn't make much sense.
Contras:
1. tsDTM doesn't need centralized arbiter but still needs DLM
2. Logically DLM and DTM are independent componentsCan you please explain more about tsDTM approach, how timestampsare used, what is exactly CSN (is it Commit Sequence Number) andhow it is used in prepare phase? IS CSN used as timestamp?Is the coordinator also one of the PostgreSQL instance?
In tsDTM approach system time (in microseconds) is used as CSN (commit sequence number).
We also enforce that assigned CSN is unique and monotonic within PostgreSQL instance.
CSN are assigned locally and do not require interaction with some other cluster nodes.
This is why in theory tsDTM approach should provide good scalability.
Okay, but won't checking visibility of tuples need transaction to CSN
mapping?
From my point of view there are two different scenarios:
1. When most transactions are local and only few of them are global (for example most operation in branch of the back are
performed with accounts of clients of this branch, but there are few transactions involved accounts from different branches).
2. When most or all transactions are global.
It seems to me that first approach is more popular in real life and actually good performance of distributed system can be achieved only in case when most transaction are local (involves only one node). There are several approaches allowing to optimize local transactions. For example once used in SAP HANA (http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf)
We have also DTM implementation based on this approach but it is not yet working.
If most of transaction are global, them affect random subsets of cluster nodes (so it is not possible to logically split cluster into groups of tightly coupled nodes) and number of nodes is not very large (<10) then I do not think that there can be better alternative (from performance point of view) than centralized arbiter.
I am slightly confused with above statement, it seems for both
the scenario's you seem to be suggesting to have centrailized
arbiter. I think when most transactions are global having centrailized
arbiter might not be good solution, especially if the nodes in cluster
are large.
But it is only my speculations and it will be really very interesting for me to know access patterns of real customers, using distributed systems.
I think it is better if the solution is optimized for all kind of scenario's,
because once the solution is adopted by PostgreSQL, it will be very
difficult to change it.
On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 9 November 2015 at 18:46, Robert Haas <robertmhaas@gmail.com> wrote: >> > I am aware of the fact >> > that by definition PREPARE TRANSACTION ensures that a transaction will >> > be committed with COMMIT PREPARED, just trying to see any corner cases >> > with the approach proposed. The DTM approach is actually rather close >> > to what a GTM in Postgres-XC does :) >> >> Yes. I think that we should try to learn as much as possible from the >> XC experience, but that doesn't mean we should incorporate XC's fuzzy >> thinking about 2PC into PG. We should not. > > Fuzzy thinking. Please explain. Multiple people who have worked on XC have argued to me that we need to defend against the case where PREPARE TRANSACTION succeeds and COMMIT PREPARED fails, say because somebody manually mucked with the files in the data directory. I think that if people are manually mucking with files in the data directory, we have no hope of achieving sane behavior, and there's not much point in expending code on defending against any individual way that could happen. >> One point I'd like to mention is that it's absolutely critical to >> design this in a way that minimizes network roundtrips without >> compromising correctness. XC's GTM proxy suggests that they failed to >> do that. I think we really need to look at what's going to be on the >> other sides of the proposed APIs and think about whether it's going to >> be possible to have a strong local caching layer that keeps network >> roundtrips to a minimum. We should consider whether the need for such >> a caching layer has any impact on what the APIs should look like. > You mean the caching layer that already exists in XL/XC? I don't think that's what I mean, no. If you have an external GTM Proxy, then you have missed a trick, because whatever caching it does could be done better inside the process that sent the request to the proxy. >> For example, consider a 10-node cluster where each node has 32 cores >> and 32 clients, and each client is running lots of short-running SQL >> statements. The demand for snapshots will be intense. If every >> backend separately requests a snapshot for every SQL statement from >> the coordinator, that's probably going to be terrible. We can make it >> the problem of the stuff behind the DTM API to figure out a way to >> avoid that, but maybe that's going to result in every DTM needing to >> solve the same problems. > > The whole purpose of that XTM API is to allow different solutions for that > to be created. Konstantin has made a very good case for such an API to > exist, based around 3 markedly different approaches. I'm not arguing with that. > Whether we allow the API into core to be accessible via extensions is a > different issue, but it looks fine for its purpose. I'm not attacking the API. I'm trying to have a discussion about the important design issues in this area. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Sure. Now imagine that the pg_twophase entry is corrupted for this >> transaction on one node. This would trigger a PANIC on it, and >> transaction would not be committed everywhere. > > If the database is corrupted, there's no way to guarantee that > anything works as planned. This is like saying that criticizing > somebody's disaster recovery plan on the basis that it will be > inadequate if the entire planet earth is destroyed. As well as there could be FS, OS, network problems... To come back to the point, my point is simply that I found surprising the sentence of Konstantin upthread saying that if commit fails on some of the nodes we should rollback the prepared transaction on all nodes. In the example given, in the phase after calling dtm_end_prepare, say we perform COMMIT PREPARED correctly on node 1, but then failed it on node 2 because a meteor has hit a server, it seems that we cannot rollback, instead we had better rolling in a backup and be sure that the transaction gets committed. How would you rollback the transaction already committed on node 1? But perhaps I missed something... >> I am aware of the fact >> that by definition PREPARE TRANSACTION ensures that a transaction will >> be committed with COMMIT PREPARED, just trying to see any corner cases >> with the approach proposed. The DTM approach is actually rather close >> to what a GTM in Postgres-XC does :) > > Yes. I think that we should try to learn as much as possible from the > XC experience, but that doesn't mean we should incorporate XC's fuzzy > thinking about 2PC into PG. We should not. Yep. > One point I'd like to mention is that it's absolutely critical to > design this in a way that minimizes network roundtrips without > compromising correctness. XC's GTM proxy suggests that they failed to > do that. I think we really need to look at what's going to be on the > other sides of the proposed APIs and think about whether it's going to > be possible to have a strong local caching layer that keeps network > roundtrips to a minimum. We should consider whether the need for such > a caching layer has any impact on what the APIs should look like. At this time, the number of round trips needed particularly for READ COMMITTED transactions that need a new snapshot for each query was really a performance killer. We used DBT-1 (TPC-W) which is less OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit was quickly reached with 10-20 nodes.. > For example, consider a 10-node cluster where each node has 32 cores > and 32 clients, and each client is running lots of short-running SQL > statements. The demand for snapshots will be intense. If every > backend separately requests a snapshot for every SQL statement from > the coordinator, that's probably going to be terrible. We can make it > the problem of the stuff behind the DTM API to figure out a way to > avoid that, but maybe that's going to result in every DTM needing to > solve the same problems. This recalls a couple of things, though in 2009 I was not playing with servers of this scale. > Another thing that I think we need to consider is fault-tolerance. > For example, suppose that transaction commit and snapshot coordination > services are being provided by a central server which keeps track of > the global commit ordering. When that server gets hit by a freak bold > of lightning and melted into a heap of slag, somebody else needs to > take over. Again, this would live below the API proposed here, but I > think it really deserves some thought before we get too far down the > path. XC didn't start thinking about how to add fault-tolerance until > quite late in the project, I think, and the same could be said of > PostgreSQL itself: some newer systems have easier-to-use fault > tolerance mechanisms because it was designed in from the beginning. > Distributed systems by nature need high availability to a far greater > degree than single systems, because when there are more nodes, node > failures are more frequent. Your memories on the matter are right. In the case of XC, the SPOF that is GTM has been somewhat made more stable with the creation of a GTM standby, though it did not solve the scalability limit of having a centralized snapshot facility. It actually increased the load on the whole system because for short transactions. Alea jacta est. -- Michael
On Thu, Nov 12, 2015 at 11:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 9, 2015 at 2:47 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 9 November 2015 at 18:46, Robert Haas <robertmhaas@gmail.com> wrote: >>> One point I'd like to mention is that it's absolutely critical to >>> design this in a way that minimizes network roundtrips without >>> compromising correctness. XC's GTM proxy suggests that they failed to >>> do that. I think we really need to look at what's going to be on the >>> other sides of the proposed APIs and think about whether it's going to >>> be possible to have a strong local caching layer that keeps network >>> roundtrips to a minimum. We should consider whether the need for such >>> a caching layer has any impact on what the APIs should look like. >> You mean the caching layer that already exists in XL/XC? > > I don't think that's what I mean, no. If you have an external GTM > Proxy, then you have missed a trick, because whatever caching it does > could be done better inside the process that sent the request to the > proxy. Definitely. Having a single communication channel between the backends and the centralized server that communicate with grouped messages would clearly help enhancing the scalability of the system, though this is not something the DTM should try to solve IMO. >>> For example, consider a 10-node cluster where each node has 32 cores >>> and 32 clients, and each client is running lots of short-running SQL >>> statements. The demand for snapshots will be intense. If every >>> backend separately requests a snapshot for every SQL statement from >>> the coordinator, that's probably going to be terrible. We can make it >>> the problem of the stuff behind the DTM API to figure out a way to >>> avoid that, but maybe that's going to result in every DTM needing to >>> solve the same problems. >> >> The whole purpose of that XTM API is to allow different solutions for that >> to be created. Konstantin has made a very good case for such an API to >> exist, based around 3 markedly different approaches. > > I'm not arguing with that. > >> Whether we allow the API into core to be accessible via extensions is a >> different issue, but it looks fine for its purpose. > > I'm not attacking the API. I'm trying to have a discussion about the > important design issues in this area. FWIW, I just looked at the wiki page regarding the DTM, and the set of routines GTM is actually very close to what XC/XL is doing. Where XC/XL directly forked the code of Postgres to redirect to the GTM when fetching a snapshot, TXID, whatever, DTM is using a set of generic methods to achieve this purpose and replace the in-core calls with a set of custom functions to relocate to the external instance in charge of distributing the transaction data. Other differences are that GTM is extended for global timestamps and global sequence values, but that's not really related to ACID. This just reminded me the following message where I wondered about the possibility to add hooks in those hot code paths... http://www.postgresql.org/message-id/CAB7nPqTDjf-58wuf-xZ01NKJ7WF0E+EUKgGQHd0igVsOD4hCJQ@mail.gmail.com -- Michael
On 13 November 2015 at 21:35, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Sure. Now imagine that the pg_twophase entry is corrupted for this
>> transaction on one node. This would trigger a PANIC on it, and
>> transaction would not be committed everywhere.
>
> If the database is corrupted, there's no way to guarantee that
> anything works as planned. This is like saying that criticizing
> somebody's disaster recovery plan on the basis that it will be
> inadequate if the entire planet earth is destroyed.
As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...
The usual way this works in an XA-like model is:
In phase 1 (prepare transaction, in Pg's spelling), failure on any node triggers a rollback on all nodes.
In phase 2 (commit prepared), failure on any node causes retries until it succeeds, or until the admin intervenes - say, to remove that node from operation. The global xact as a whole isn't considered successful until it's committed on all nodes.
2PC and distributed commit is well studied, including the problems. We don't have to think this up for ourselves. We don't have to invent anything here. There's a lot of distributed systems theory to work with - especially when dealing with well studied relational DBs trying to maintain ACID semantics.
Not to say that there aren't problems with the established ways. The XA API is horrific. Java's JTA follows it too closely, and whoever thought that HeuristicMixedException was a good idea.... augh.
On Saturday, November 14, 2015 8:41 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 13 November 2015 at 21:35, Michael Paquier <michael.paquier@gmail.com> wrote: >> On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> If the database is corrupted, there's no way to guarantee that >>> anything works as planned. This is like saying that criticizing >>> somebody's disaster recovery plan on the basis that it will be >>> inadequate if the entire planet earth is destroyed. +1 Once all participating servers return "success" from the "prepare" phase, the only sane way to proceed is to commit as much as possible. (I guess you still have some room to argue about what to do after an error on the attempt to commit the first prepared transaction, but even there you need to move forward if it might have been due to a communications failure preventing a success indication for a commit which was actually successful.) The idea of two phase commit is that failure to commit after a successful prepare should be extremely rare. >> As well as there could be FS, OS, network problems... To come back to >> the point, my point is simply that I found surprising the sentence of >> Konstantin upthread saying that if commit fails on some of the nodes >> we should rollback the prepared transaction on all nodes. In the >> example given, in the phase after calling dtm_end_prepare, say we >> perform COMMIT PREPARED correctly on node 1, but then failed it on >> node 2 because a meteor has hit a server, it seems that we cannot >> rollback, instead we had better rolling in a backup and be sure that >> the transaction gets committed. How would you rollback the transaction >> already committed on node 1? But perhaps I missed something... Right. > The usual way this works in an XA-like model is: > > In phase 1 (prepare transaction, in Pg's spelling), failure on > any node triggers a rollback on all nodes. > > In phase 2 (commit prepared), failure on any node causes retries > until it succeeds, or until the admin intervenes - say, to remove > that node from operation. The global xact as a whole isn't > considered successful until it's committed on all nodes. > > 2PC and distributed commit is well studied, including the > problems. We don't have to think this up for ourselves. We don't > have to invent anything here. There's a lot of distributed > systems theory to work with - especially when dealing with well > studied relational DBs trying to maintain ACID semantics. Right; it is silly not to build on decades of work on theory and practice in this area. What is making me nervous as I watch this thread is a bit of loose talk about the I in ACID. There have been some points where it seemed to be implied that we had sufficient transaction isolation if we could get all the participating nodes using the same snapshot. I think I've seen some hint that there is an intention to use distributed strict two-phase locking with a global lock manager to achieve serializable behavior. The former is vulnerable to some well-known serialization anomalies and the latter was dropped from PostgreSQL when MVCC was implemented because of its horrible concurrency and performance characteristics, which is not going to be less of an issue in a distributed system using that technique. It may be possible to implement the Serializable Snapshot Isolation (SSI) technique across multiple cooperating nodes, but it's not obvious exactly how that would be implemented, and I have seen no discussion of that. If we're going to talk about maintaining ACID semantics in this environment, I think we need to explicitly state what level of isolation we intend to provide, and how we intend to do that. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I am sorry, looks like I had incorrectly interpret Michael's comment: > Not always. If COMMIT PREPARED fails at some of the nodes but succeeds > on others, the transaction is already partially acknowledged as > committed in the cluster. Hence it makes more sense for the > coordinator to commit transactions on the remaining nodes. Before > issuing any COMMIT PREPARED queries, I guess that's fine to rollback > the transactions on all nodes though. I am completely agree that at second stage pf 2PC, once we make a decision to commit transaction, there is no way back: wehave to complete commit at all nodes. If node reports that it has successfully prepared transaction and ready to commit,then it mean that node should be able to recover it in case of failure. In my example of code in GO illustrating work with 2PC Ifor simplicity completely exclude login of handling errors. Actually it should be something like this (now in C++ using pqxx): nontransaction srcTxn(srcCon); nontransaction dstTxn(dstCon); string transComplete; try { exec(srcTxn,"prepare transaction '%s'", gtid); exec(dstTxn, "prepare transaction '%s'", gtid); // At this momentall nodes has prepared transaction, so in principle we can make a decision to commit here... // But it is easierto wait until CSN is assigned to transaction and it is delivered to all nodes exec(srcTxn, "select dtm_begin_prepare('%s')",gtid); exec(dstTxn, "select dtm_begin_prepare('%s')", gtid); csn = execQuery(srcTxn,"select dtm_prepare('%s', 0)", gtid); csn = execQuery(dstTxn, "select dtm_prepare('%s', %lu)", gtid,csn); exec(srcTxn, "select dtm_end_prepare('%s', %lu)", gtid, csn); exec(dstTxn, "select dtm_end_prepare('%s'%lu)", gtid, csn); // If we reach this point, we make decision to commit... transComplete= string("commit prepared '") + gtid + "'"; } catch (pqxx_exception const& x) { // ... if not, thenrollback prepared transaction transComplete = string("rollback prepared '") + gtid + "'"; } pipeline srcPipe(srcTxn); pipeline dstPipe(dstTxn); srcPipe.insert(completeTrans); dstPipe.insert(compteteTrans); srcPipe.complete(); dstPipe.complete(); On 11/15/2015 08:59 PM, Kevin Grittner wrote: > On Saturday, November 14, 2015 8:41 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 13 November 2015 at 21:35, Michael Paquier <michael.paquier@gmail.com> wrote: >>> On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> If the database is corrupted, there's no way to guarantee that >>>> anything works as planned. This is like saying that criticizing >>>> somebody's disaster recovery plan on the basis that it will be >>>> inadequate if the entire planet earth is destroyed. > +1 > > Once all participating servers return "success" from the "prepare" > phase, the only sane way to proceed is to commit as much as > possible. (I guess you still have some room to argue about what to > do after an error on the attempt to commit the first prepared > transaction, but even there you need to move forward if it might > have been due to a communications failure preventing a success > indication for a commit which was actually successful.) The idea > of two phase commit is that failure to commit after a successful > prepare should be extremely rare. > >>> As well as there could be FS, OS, network problems... To come back to >>> the point, my point is simply that I found surprising the sentence of >>> Konstantin upthread saying that if commit fails on some of the nodes >>> we should rollback the prepared transaction on all nodes. In the >>> example given, in the phase after calling dtm_end_prepare, say we >>> perform COMMIT PREPARED correctly on node 1, but then failed it on >>> node 2 because a meteor has hit a server, it seems that we cannot >>> rollback, instead we had better rolling in a backup and be sure that >>> the transaction gets committed. How would you rollback the transaction >>> already committed on node 1? But perhaps I missed something... > Right. > >> The usual way this works in an XA-like model is: >> >> In phase 1 (prepare transaction, in Pg's spelling), failure on >> any node triggers a rollback on all nodes. >> >> In phase 2 (commit prepared), failure on any node causes retries >> until it succeeds, or until the admin intervenes - say, to remove >> that node from operation. The global xact as a whole isn't >> considered successful until it's committed on all nodes. >> >> 2PC and distributed commit is well studied, including the >> problems. We don't have to think this up for ourselves. We don't >> have to invent anything here. There's a lot of distributed >> systems theory to work with - especially when dealing with well >> studied relational DBs trying to maintain ACID semantics. > Right; it is silly not to build on decades of work on theory and > practice in this area. > > What is making me nervous as I watch this thread is a bit of loose > talk about the I in ACID. There have been some points where it > seemed to be implied that we had sufficient transaction isolation > if we could get all the participating nodes using the same > snapshot. I think I've seen some hint that there is an intention > to use distributed strict two-phase locking with a global lock > manager to achieve serializable behavior. The former is vulnerable > to some well-known serialization anomalies and the latter was > dropped from PostgreSQL when MVCC was implemented because of its > horrible concurrency and performance characteristics, which is not > going to be less of an issue in a distributed system using that > technique. It may be possible to implement the Serializable > Snapshot Isolation (SSI) technique across multiple cooperating > nodes, but it's not obvious exactly how that would be implemented, > and I have seen no discussion of that. > > If we're going to talk about maintaining ACID semantics in this > environment, I think we need to explicitly state what level of > isolation we intend to provide, and how we intend to do that. > > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Fri, Nov 13, 2015 at 8:35 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > As well as there could be FS, OS, network problems... To come back to > the point, my point is simply that I found surprising the sentence of > Konstantin upthread saying that if commit fails on some of the nodes > we should rollback the prepared transaction on all nodes. In the > example given, in the phase after calling dtm_end_prepare, say we > perform COMMIT PREPARED correctly on node 1, but then failed it on > node 2 because a meteor has hit a server, it seems that we cannot > rollback, instead we had better rolling in a backup and be sure that > the transaction gets committed. How would you rollback the transaction > already committed on node 1? But perhaps I missed something... Right. In that case, we have to try to eventually get it committed everywhere. One thing that's a bit confusing about this XTM interface is what "COMMIT" actually means. The idea is that on the standby server we will call some DTM-provided function and pass it a token. Then we will begin and commit a transaction. But presumably the commit does not actually commit, because if it's a single transaction on all nodes then the commit can't be completed until all work is done all nodes. So my guess is that the COMMIT here is intended to behave more like a PREPARE, but this is not made explicit. >> One point I'd like to mention is that it's absolutely critical to >> design this in a way that minimizes network roundtrips without >> compromising correctness. XC's GTM proxy suggests that they failed to >> do that. I think we really need to look at what's going to be on the >> other sides of the proposed APIs and think about whether it's going to >> be possible to have a strong local caching layer that keeps network >> roundtrips to a minimum. We should consider whether the need for such >> a caching layer has any impact on what the APIs should look like. > > At this time, the number of round trips needed particularly for READ > COMMITTED transactions that need a new snapshot for each query was > really a performance killer. We used DBT-1 (TPC-W) which is less > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit > was quickly reached with 10-20 nodes.. Yeah. I think this merits a good bit of thought. Superficially, at least, it seems that every time you need a snapshot - which in the case of READ COMMITTED is for every SQL statement - you need a network roundtrip to the snapshot server. If multiple backends request a snapshot in very quick succession, you might be able to do a sort of "group commit" thing where you send a single request to the server and they all use the resulting snapshot, but it seems hard to get very far with such optimization. For example, if backend 1 sends a snapshot request and backend 2 then realizes that it also needs a snapshot, it can't just wait for the reply from backend 1 and use that one. The user might have committed a transaction someplace else and then kicked off a transaction on backend 2 afterward, expecting it to see the work committed earlier. But the snapshot returned to backend 1 might have been taken before that. So, all in all, this seems rather crippling. Things are better if the system has a single coordinator node that is also the arbiter of commits and snapshots. Then, it can always take a snapshot locally with no network roundtrip, and when it reaches out to a shard, it can pass along the snapshot information with the SQL query (or query plan) it has to send anyway. But then the single coordinator seems like it becomes a bottleneck. As soon as you have multiple coordinators, one of them has got to be the arbiter of global ordering, and now all of the other coordinators have to talk to it constantly. Maybe I'm thinking of this too narrowly by talking about snapshots; perhaps there are other ways of ensuring whatever level of transaction isolation we want to have here. But I'm not sure it matters that much - I don't see any way for the sees-the-effects-of relation on the set of all transactions to be a total ordering without some kind of central arbiter of the commit ordering. Except for perfectly-synchronized timestamps, but I don't think that's really physically possible anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 17, 2015 at 12:48:38PM -0500, Robert Haas wrote: > > At this time, the number of round trips needed particularly for READ > > COMMITTED transactions that need a new snapshot for each query was > > really a performance killer. We used DBT-1 (TPC-W) which is less > > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit > > was quickly reached with 10-20 nodes.. > > Yeah. I think this merits a good bit of thought. Superficially, at > least, it seems that every time you need a snapshot - which in the > case of READ COMMITTED is for every SQL statement - you need a network > roundtrip to the snapshot server. If multiple backends request a > snapshot in very quick succession, you might be able to do a sort of > "group commit" thing where you send a single request to the server and > they all use the resulting snapshot, but it seems hard to get very far > with such optimization. For example, if backend 1 sends a snapshot > request and backend 2 then realizes that it also needs a snapshot, it > can't just wait for the reply from backend 1 and use that one. The > user might have committed a transaction someplace else and then kicked > off a transaction on backend 2 afterward, expecting it to see the work > committed earlier. But the snapshot returned to backend 1 might have > been taken before that. So, all in all, this seems rather crippling. > > Things are better if the system has a single coordinator node that is > also the arbiter of commits and snapshots. Then, it can always take a > snapshot locally with no network roundtrip, and when it reaches out to > a shard, it can pass along the snapshot information with the SQL query > (or query plan) it has to send anyway. But then the single > coordinator seems like it becomes a bottleneck. As soon as you have > multiple coordinators, one of them has got to be the arbiter of global > ordering, and now all of the other coordinators have to talk to it > constantly. I think the performance benefits of having a single coordinator are going to require us to implement different snapshot/transaction code paths for single coordinator and multi-coordinator cases. :-( That is, people who can get by with only a single coordinator are going to want to do that to avoid the overhead of multiple coordinators, while those using multiple coordinators are going to have to live with that overhead. I think an open question is what workloads can use a single coordinator? Read-only? Long queries? Are those cases also ones where the snapshot/transaction overhead is negligible, meaning we don't need the single coordinator optimizations? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
On Tue, Dec 1, 2015 at 9:19 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Tue, Nov 17, 2015 at 12:48:38PM -0500, Robert Haas wrote: >> > At this time, the number of round trips needed particularly for READ >> > COMMITTED transactions that need a new snapshot for each query was >> > really a performance killer. We used DBT-1 (TPC-W) which is less >> > OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit >> > was quickly reached with 10-20 nodes.. >> >> Yeah. I think this merits a good bit of thought. Superficially, at >> least, it seems that every time you need a snapshot - which in the >> case of READ COMMITTED is for every SQL statement - you need a network >> roundtrip to the snapshot server. If multiple backends request a >> snapshot in very quick succession, you might be able to do a sort of >> "group commit" thing where you send a single request to the server and >> they all use the resulting snapshot, but it seems hard to get very far >> with such optimization. For example, if backend 1 sends a snapshot >> request and backend 2 then realizes that it also needs a snapshot, it >> can't just wait for the reply from backend 1 and use that one. The >> user might have committed a transaction someplace else and then kicked >> off a transaction on backend 2 afterward, expecting it to see the work >> committed earlier. But the snapshot returned to backend 1 might have >> been taken before that. So, all in all, this seems rather crippling. >> >> Things are better if the system has a single coordinator node that is >> also the arbiter of commits and snapshots. Then, it can always take a >> snapshot locally with no network roundtrip, and when it reaches out to >> a shard, it can pass along the snapshot information with the SQL query >> (or query plan) it has to send anyway. But then the single >> coordinator seems like it becomes a bottleneck. As soon as you have >> multiple coordinators, one of them has got to be the arbiter of global >> ordering, and now all of the other coordinators have to talk to it >> constantly. > > I think the performance benefits of having a single coordinator are > going to require us to implement different snapshot/transaction code > paths for single coordinator and multi-coordinator cases. :-( That is, > people who can get by with only a single coordinator are going to want > to do that to avoid the overhead of multiple coordinators, while those > using multiple coordinators are going to have to live with that > overhead. > > I think an open question is what workloads can use a single coordinator? > Read-only? Long queries? Are those cases also ones where the > snapshot/transaction overhead is negligible, meaning we don't need the > single coordinator optimizations? Well, the cost of the coordinator is basically a per-snapshot cost, which means in effect a per-transaction cost. So if your typical query runtimes are measured in seconds or minutes, it's probably going to be really hard to swamp the coordinator. If they're measured milliseconds or microseconds, it's likely to be a huge problem. I think this is why a lot of NoSQL systems have abandoned transactions as a way of doing business. The overhead of isolating transactions is significant even on a single-server system when there are many CPUs and lots of short-running queries (cf. 0e141c0fbb211bdd23783afa731e3eef95c9ad7a and previous efforts in the same area) but in a multi-server environment it just gets crushingly expensive. I think there are ways to reduce the cost of this. Some distributed systems have solved it by retreating from snapshot isolation and going back to using locks. This can improve scalability if you've got lots of transactions but they're very short and rarely touch the same data. Locking individual data elements (or partitions) doesn't require a central system that knows about all commits - each individual node can just release locks for each transaction as it completes. More generally, if you could avoid having to make a decision about whether transaction A precedes or follows transaction B unless transaction A and B actually touch the same data - an idea we already use for SSI - then you could get much better scalability. But I doubt that can be made to work without a deeper rethink of our transaction system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4 December 2015 at 06:41, Robert Haas <robertmhaas@gmail.com> wrote:
I think there are ways to reduce the cost of this. Some distributed
systems have solved it by retreating from snapshot isolation and going
back to using locks. This can improve scalability if you've got lots
of transactions but they're very short and rarely touch the same data.
Locking individual data elements (or partitions) doesn't require a
central system that knows about all commits - each individual node can
just release locks for each transaction as it completes. More
generally, if you could avoid having to make a decision about whether
transaction A precedes or follows transaction B unless transaction A
and B actually touch the same data - an idea we already use for SSI -
then you could get much better scalability. But I doubt that can be
made to work without a deeper rethink of our transaction system.
Something I've seen thrown about is the idea of lazy snapshots. Using SSI-like facilities you keep track of transaction dependencies and what's changed since the last snapshot. A transaction can use an old snapshot if it doesn't touch anything changed since the snapshot was taken. If it does you acquire a new snapshot (or switch to a newer existing one) and can swap it in safely, since you know they're the same for all data the xact has touched so far.
I suspect that just replaces one expensive problem with one or more different expensive problems and is something that'd help a few workloads but hurt others. It's come up when people have asked why we take a new snapshot for every transaction but I haven't seen it explored much.