Re: eXtensible Transaction Manager API - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: eXtensible Transaction Manager API
Date
Msg-id 56405B9F.6030406@postgrespro.ru
Whole thread Raw
In response to Re: eXtensible Transaction Manager API  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: eXtensible Transaction Manager API  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers


On 09.11.2015 07:46, Amit Kapila wrote:
On Sat, Nov 7, 2015 at 10:23 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

Lock manager is one of the tasks we are currently working on.
There are still a lot of open questions:
1. Should distributed lock manager (DLM) do something else except detection of distributed deadlock?

I think so.  Basically DLM should be responsible for maintaining
all the lock information which inturn means that any backend process
that needs to acquire/release lock needs to interact with DLM, without that
I don't think even global deadlock detection can work (to detect deadlocks
among all the nodes, it needs to know the lock info of all nodes).

I hope that it will not needed, otherwise it will add significant performance penalty.
Unless I missed something, locks can still managed locally, but we need DLM to detect global deadlocks.
Deadlock detection in PostgreSQL is performed only after timeout expiration for lock waiting. Only then lock graph is analyzed for presence of loops. At this moment we can send our local lock graph to DLM. If it is really distributed deadlock, then sooner or later
all nodes involved in this deadlock will send their local graphs to DLM and DLM will be able to build global graph and detect distributed deadlock. In this scenario DLM will be accessed very rarely - only when lock can not be granted within deadlock_timeout.

One of the possible problems is false global deadlock detection, because local lock graphs corresponds to different moments of time, so we can found "false" loop. I am still thinking about best solution of this problem.


This is somewhat inline with what currently we do during lock conflicts,
i.e if the backend incur a lock conflict and decides to sleep, before
sleeping it tries to detect an early deadlock, so if we make DLM responsible
for managing locks the same could be even achieved in distributed system.
 
2. Should DLM be part of XTM API or it should be separate API?

We might be able to do it either ways (make it part of XTM API or devise a
separate API). I think here more important point is to first get the high level
design for Distributed Transactions (which primarily includes consistent
Commit/Rollback, snapshots, distributed lock manager (DLM) and recovery).

 
3. Should DLM be implemented by separate process or should it be part of arbiter (dtmd).

That's important decision. I think it will depend on which kind of design
we choose for distributed transaction manager (arbiter based solution or
non-arbiter based solution, something like tsDTM).  I think DLM should be
separate, else arbiter will become hot-spot with respect to contention.

There are pros and contras.
Pros for integrating DLM in DTM:
1. DTM arbiter has information about local to global transaction ID mapping which may be needed to DLM
2. If my assumptions about DLM are correct, then it will be accessed relatively rarely and should not cause significant
impact on performance.

Contras:
1. tsDTM doesn't need centralized arbiter but still needs DLM
2. Logically DLM and DTM are independent components



Can you please explain more about tsDTM approach, how timestamps
are used, what is exactly CSN (is it Commit Sequence Number) and
how it is used in prepare phase?  IS CSN used as timestamp?
Is the coordinator also one of the PostgreSQL instance?

In tsDTM approach system time (in microseconds) is used as CSN (commit sequence number).
We also enforce that assigned CSN is unique and monotonic within PostgreSQL instance.
CSN are assigned locally and do not require interaction with some other cluster nodes.
This is why in theory tsDTM approach should provide good scalability.


I think in this patch, it is important to see the completeness of all the
API's that needs to be exposed for the implementation of distributed
transactions and the same is difficult to visualize without having complete
picture of all the components that has some interaction with the distributed
transaction system.  On the other hand we can do it in incremental fashion
as and when more parts of the design are clear.

That is exactly what we are going to do - we are trying to integrate DTM with existed systems (pg_shard, postgres_fdw, BDR) and find out what is missed and should be added. In parallel we are trying to compare efficiency and scalability of different solutions.

One thing, I have noticed that in DTM approach, you seems to
be considering a centralized (Arbiter) transaction manager and
centralized Lock manager which seems to be workable approach,
but I think such an approach won't scale in write-heavy or mixed
read-write workload.  Have you thought about distributing responsibility
of global transaction and lock management?

From my point of view there are two different scenarios:
1. When most transactions are local and only few of them are global (for example most  operation in branch of the back are
performed with accounts of clients of this branch, but there are few transactions involved accounts from different branches).
2. When most or all transactions are global.

It seems to me that first approach is more popular in real life and actually good performance of distributed system can be achieved only in case when most transaction are local (involves only one node). There are several approaches allowing to optimize local transactions. For example once used in SAP HANA (http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf)
We have also DTM implementation based on this approach but it is not yet working.

If most of transaction are global, them affect random subsets of cluster nodes (so it is not possible to logically split cluster into groups of tightly coupled nodes) and number of nodes is not very large (<10) then I do not think that there can be better alternative (from performance point of view) than centralized arbiter.

But it is only my speculations and it will be really very interesting for me to know access patterns of real customers, using distributed systems.

I think such a system
might be somewhat difficult to design, but the scalability will be better.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Parallel Seq Scan
Next
From: Albe Laurenz
Date:
Subject: Re: Git cartoon