Thread: Transactions involving multiple postgres foreign servers
Hi All,
While looking at the patch for supporting inheritance on foreign tables, I noticed that if a transaction makes changes to more than two foreign servers the current implementation in postgres_fdw doesn't make sure that either all of them rollback or all of them commit their changes, IOW there is a possibility that some of them commit their changes while others rollback theirs.---------------
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: > While looking at the patch for supporting inheritance on foreign tables, I > noticed that if a transaction makes changes to more than two foreign > servers the current implementation in postgres_fdw doesn't make sure that > either all of them rollback or all of them commit their changes, IOW there > is a possibility that some of them commit their changes while others > rollback theirs. > PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at > XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the > foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT > event it commits or aborts those transactions resp. TBH, I think this is a pretty awful idea. In the first place, this does little to improve the actual reliability of a commit occurring across multiple foreign servers; and in the second place it creates a bunch of brand new failure modes, many of which would require manual DBA cleanup. The core of the problem is that this doesn't have anything to do with 2PC as it's commonly understood: for that, you need a genuine external transaction manager that is aware of all the servers involved in a transaction, and has its own persistent state (or at least a way to reconstruct its own state by examining the per-server states). This patch is not that; in particular it treats the local transaction asymmetrically from the remote ones, which doesn't seem like a great idea --- ie, the local transaction could still abort after committing all the remote ones, leaving you no better off in terms of cross-server consistency. As far as failure modes go, one basic reason why this cannot work as presented is that the remote servers may not even have prepared transaction support enabled (in fact max_prepared_transactions = 0 is the default in all supported PG versions). So this would absolutely have to be a not-on-by-default option. But the bigger issue is that leaving it to the DBA to clean up after failures is not a production grade solution, *especially* not for prepared transactions, which are performance killers if not closed out promptly. So I can't imagine anyone wanting to turn this on without a more robust answer than that. Basically I think what you'd need for this to be a credible patch would be for it to work by changing the behavior only in the PREPARE TRANSACTION path: rather than punting as we do now, prepare the remote transactions, and report their server identities and gids to an external transaction manager, which would then be responsible for issuing the actual commits (along with the actual commit of the local transaction). I have no idea whether it's feasible to do that without having to assume a particular 2PC transaction manager API/implementation. It'd be interesting to hear from people who are using 2PC in production to find out if this would solve any real-world problems for them, and what the details of the TM interface would need to look like to make it work in practice. In short, you can't force 2PC technology on people who aren't using it already; while for those who are using it already, this isn't nearly good enough as-is. regards, tom lane
On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In short, you can't force 2PC technology on people who aren't using it > already; while for those who are using it already, this isn't nearly > good enough as-is. I was involved in some internal discussions related to this patch, so I have some opinions on it. The long-term, high-level goal here is to facilitate sharding. If we've got a bunch of PostgreSQL servers interlinked via postgres_fdw, it should be possible to perform transactions on the cluster in such a way that transactions are just as atomic, consistent, isolated, and durable as they would be with just one server. As far as I know, there is no way to achieve this goal through the use of an external transaction manager, because even if that external transaction manager guarantees, for every transaction, that the transaction either commits on all nodes or rolls back on all nodes, there's no way for it to guarantee that other transactions won't see some intermediate state where the commit has been completed on some nodes but not others. To get that, you need some of integration that reaches down to the way snapshots are taken. I think, though, that it might be worthwhile to first solve the simpler problem of figuring out how to ensure that a transaction commits everywhere or rolls back everywhere, even if intermediate states might still be transiently visible. I don't think this patch, as currently designed, is equal to that challenge, because XACT_EVENT_PRE_COMMIT fires before the transaction is certain to commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify could still error out. We could have a hook that fires after that, but that doesn't solve the problem if a user of that hook can itself throw an error. Even if part of the API contract is that it's not allowed to do so, the actual attempt to commit the change on the remote side can fail due to - e.g. - a network interruption, and that's go to be dealt with somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I was involved in some internal discussions related to this patch, so > I have some opinions on it. The long-term, high-level goal here is to > facilitate sharding. If we've got a bunch of PostgreSQL servers > interlinked via postgres_fdw, it should be possible to perform > transactions on the cluster in such a way that transactions are just > as atomic, consistent, isolated, and durable as they would be with > just one server. As far as I know, there is no way to achieve this > goal through the use of an external transaction manager, because even > if that external transaction manager guarantees, for every > transaction, that the transaction either commits on all nodes or rolls > back on all nodes, there's no way for it to guarantee that other > transactions won't see some intermediate state where the commit has > been completed on some nodes but not others. To get that, you need > some of integration that reaches down to the way snapshots are taken. That's a laudable goal, but I would bet that nothing built on the FDW infrastructure will ever get there. Certainly the proposed patch doesn't look like it moves us very far towards that set of goalposts. > I think, though, that it might be worthwhile to first solve the > simpler problem of figuring out how to ensure that a transaction > commits everywhere or rolls back everywhere, even if intermediate > states might still be transiently visible. Perhaps. I suspect that it might still be a dead end if the ultimate goal is cross-system atomic commit ... but likely it would teach us some useful things anyway. regards, tom lane
On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > That's a laudable goal, but I would bet that nothing built on the FDW > infrastructure will ever get there. Why? It would be surprising to me if, given that we have gone to some pains to create a system that allows cross-system queries, and hopefully eventually pushdown of quals, joins, and aggregates, we then made sharding work in some completely different way that reuses none of that infrastructure. But maybe I am looking at this the wrong way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That's a laudable goal, but I would bet that nothing built on the FDW >> infrastructure will ever get there. > Why? > It would be surprising to me if, given that we have gone to some pains > to create a system that allows cross-system queries, and hopefully > eventually pushdown of quals, joins, and aggregates, we then made > sharding work in some completely different way that reuses none of > that infrastructure. But maybe I am looking at this the wrong way. Well, we intentionally didn't couple the FDW stuff closely into transaction commit, because of the thought that the "far end" would not necessarily have Postgres-like transactional behavior, and even if it did there would be about zero chance of having atomic commit with a non-Postgres remote server. postgres_fdw is a seriously bad starting point as far as that goes, because it encourages one to make assumptions that can't possibly work for any other wrapper. I think the idea I sketched upthread of supporting an external transaction manager might be worth pursuing, in that it would potentially lead to having at least an approximation of atomic commit across heterogeneous servers. Independently of that, I think what you are talking about would be better addressed outside the constraints of the FDW mechanism. That's not to say that we couldn't possibly make postgres_fdw use some additional non-FDW infrastructure to manage commits; just that solving this in terms of the FDW infrastructure seems wrongheaded to me. regards, tom lane
On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, we intentionally didn't couple the FDW stuff closely into > transaction commit, because of the thought that the "far end" would not > necessarily have Postgres-like transactional behavior, and even if it did > there would be about zero chance of having atomic commit with a > non-Postgres remote server. postgres_fdw is a seriously bad starting > point as far as that goes, because it encourages one to make assumptions > that can't possibly work for any other wrapper. Atomic commit is something that can potentially be supported by many different FDWs, as long as the thing on the other end supports 2PC. If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC, then you can PREPARE the transaction and then go back and COMMIT the transaction once it's committed locally. Getting a cluster-wide *snapshot* is probably a PostgreSQL-only thing requiring much deeper integration, but I think it would be sensible to leave that as a future project and solve the simpler problem first. > I think the idea I sketched upthread of supporting an external transaction > manager might be worth pursuing, in that it would potentially lead to > having at least an approximation of atomic commit across heterogeneous > servers. An important threshold question here is whether we want to rely on an external transaction manager, or build one into PostgreSQL. As far as this particular project goes, there's nothing that can't be done inside PostgreSQL. You need a durable registry of which transactions you prepared on which servers, and which XIDs they correlate to. If you have that, then you can use background workers or similar to go retry commits or rollbacks of prepared transactions until it works, even if there's been a local crash meanwhile. Alternatively, you could rely on an external transaction manager to do all that stuff. I don't have a clear sense of what that would entail, or how it might be better or worse than rolling our own. I suspect, though, that it might amount to little more than adding a middle man. I mean, a third-party transaction manager isn't going to automatically know how to commit a transaction prepared on some foreign server using some foreign data wrapper. It's going to be have to be taught that if postgres_fdw leaves a transaction in-medias-res on server OID 1234, you've got to connect to the target machine using that foreign server's connection parameters, speak libpq, and issue the appropriate COMMIT TRANSACTION command. And similarly, you're going to need to arrange to notify it before preparing that transaction so that it knows that it needs to request the COMMIT or ABORT later on. Once you've got all of that infrastructure for that in place, what are you really gaining over just doing it in PostgreSQL (or, say, a contrib module thereto)? (I'm also concerned that an external transaction manager might need the PostgreSQL client to be aware of it, whereas what we'd really like here is for the client to just speak PostgreSQL and be happy that its commits no longer end up half-done.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 5, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
1925 PreCommit_CheckForSerializationFailure();
1926
1932 PreCommit_Notify();
1933
1934 /* Prevent cancel/die interrupt while cleaning up */
1935 HOLD_INTERRUPTS();
1936
1937 /* Commit updates to the relation map --- do this as late as possible */
1938 AtEOXact_RelationMap(true);
1939
1940 /*
1941 * set the current transaction state information appropriately during
1942 * commit processing
1943 */
1944 s->state = TRANS_COMMIT;
1945
1946 /*
1947 * Here is where we really truly commit.
1948 */
1949 latestXid = RecordTransactionCommit();
1975
1976 CallXactCallbacks(XACT_EVENT_COMMIT);
On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In short, you can't force 2PC technology on people who aren't using it
> already; while for those who are using it already, this isn't nearly
> good enough as-is.
I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.
I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible.
Agreed.
I don't think this patch,
as currently designed, is equal to that challenge, because
XACT_EVENT_PRE_COMMIT fires before the transaction is certain to
commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify
could still error out. We could have a hook that fires after that,
but that doesn't solve the problem if a user of that hook can itself
throw an error. Even if part of the API contract is that it's not
allowed to do so, the actual attempt to commit the change on the
remote side can fail due to - e.g. - a network interruption, and
that's go to be dealt with somehow.
Tom mentioned
--
in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
--
--
in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
--
You have given a specific example of this case. So, let me dry run through CommitTransaction() after applying my patch.
1899 CallXactCallbacks(XACT_EVENT_PRE_COMMIT);
1899 CallXactCallbacks(XACT_EVENT_PRE_COMMIT);
While processing this event in postgres_fdw's callback pgfdw_xact_callback() sends a PREPARE TRANSACTION to all the foreign servers involved. These servers return with their success or failures. Even if one of them fails, the local transaction is aborted along-with all the prepared transactions. Only if all the foreign servers succeed we proceed further.
1925 PreCommit_CheckForSerializationFailure();
1926
1932 PreCommit_Notify();
1933
If any of these function (as you mentioned above), throws errors, the local transaction will be aborted as well as the remote prepared transactions. Note, that we haven't yet committed the local transaction (which will be done below) and also not the remote transactions which are in PREPAREd state there. Since all the transactions local as well as remote are aborted in case of error, the data is still consistent. If these steps succeed, we will proceed ahead.
1934 /* Prevent cancel/die interrupt while cleaning up */
1935 HOLD_INTERRUPTS();
1936
1937 /* Commit updates to the relation map --- do this as late as possible */
1938 AtEOXact_RelationMap(true);
1939
1940 /*
1941 * set the current transaction state information appropriately during
1942 * commit processing
1943 */
1944 s->state = TRANS_COMMIT;
1945
1946 /*
1947 * Here is where we really truly commit.
1948 */
1949 latestXid = RecordTransactionCommit();
1950
1951 TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
1952
1953 /*
1954 * Let others know about no transaction in progress by me. Note that this
1955 * must be done _before_ releasing locks we hold and _after_
1956 * RecordTransactionCommit.
1957 */
1958 ProcArrayEndTransaction(MyProc, latestXid);
1959
1951 TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
1952
1953 /*
1954 * Let others know about no transaction in progress by me. Note that this
1955 * must be done _before_ releasing locks we hold and _after_
1956 * RecordTransactionCommit.
1957 */
1958 ProcArrayEndTransaction(MyProc, latestXid);
1959
Local transaction committed. Remote transactions still in PREPAREd state. Any server (including local) crash or link failure happens here, we leave the remote transactions dangling in PREPAREd state and manual cleanup will be required.
1975
1976 CallXactCallbacks(XACT_EVENT_COMMIT);
The postgresql callback pgfdw_xact_callback() commits the PREPAREd transactions by sending COMMIT TRANSACTION to remote server (my patch). So, I don't see why would my patch cause inconsistencies. It can cause dangling PREPAREd transactions and I have already acknowledged that fact.
Am I missing something?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Sat, Jan 3, 2015 at 2:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I doubt if a TM would expect a bunch of GIDs in response to PREPARE TRANSACTION command. Per X/Open xa_prepare() expects an integer return value, specifying whether the PREPARE succeeded or not and some piggybacked statuses.
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> While looking at the patch for supporting inheritance on foreign tables, I
> noticed that if a transaction makes changes to more than two foreign
> servers the current implementation in postgres_fdw doesn't make sure that
> either all of them rollback or all of them commit their changes, IOW there
> is a possibility that some of them commit their changes while others
> rollback theirs.
> PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
> XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
> foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
> event it commits or aborts those transactions resp.
TBH, I think this is a pretty awful idea.
In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.
The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions). So this would absolutely
have to be a not-on-by-default option.
Agreed. We can have a per foreign server option, which says whether the corresponding server can participate in 2PC. A transaction spanning multiple foreign server with at least one of them not capable of participating in 2PC will need to be aborted.
But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly. So I can't imagine
anyone wanting to turn this on without a more robust answer than that.
I purposefully left that outside this patch, since it involves significant changes in core. If that's necessary for the first cut, I will work on it.
Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction). I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.
I doubt if a TM would expect a bunch of GIDs in response to PREPARE TRANSACTION command. Per X/Open xa_prepare() expects an integer return value, specifying whether the PREPARE succeeded or not and some piggybacked statuses.
In the context of foreign table under inheritance tree, a single DML can span multiple foreign servers. All such DMLs will then need to be handled by an external TM. An external TM or application may not have exact idea as to which all foreign servers are going to be affected by a DML. Users may not want to setup an external TM in such cases. Instead they would expect PostgreSQL to manage such DMLs and transactions all by itself.
As Robert has suggested in his responses, it would be better to enable PostgreSQL to manage distributed transactions itself.
It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.
In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.
regards, tom lane
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Tue, Jan 6, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Well, we intentionally didn't couple the FDW stuff closely into
> transaction commit, because of the thought that the "far end" would not
> necessarily have Postgres-like transactional behavior, and even if it did
> there would be about zero chance of having atomic commit with a
> non-Postgres remote server. postgres_fdw is a seriously bad starting
> point as far as that goes, because it encourages one to make assumptions
> that can't possibly work for any other wrapper.
Atomic commit is something that can potentially be supported by many
different FDWs, as long as the thing on the other end supports 2PC.
If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC,
then you can PREPARE the transaction and then go back and COMMIT the
transaction once it's committed locally.
Getting a cluster-wide
*snapshot* is probably a PostgreSQL-only thing requiring much deeper
integration, but I think it would be sensible to leave that as a
future project and solve the simpler problem first.
> I think the idea I sketched upthread of supporting an external transaction
> manager might be worth pursuing, in that it would potentially lead to
> having at least an approximation of atomic commit across heterogeneous
> servers.
An important threshold question here is whether we want to rely on an
external transaction manager, or build one into PostgreSQL. As far as
this particular project goes, there's nothing that can't be done
inside PostgreSQL. You need a durable registry of which transactions
you prepared on which servers, and which XIDs they correlate to. If
you have that, then you can use background workers or similar to go
retry commits or rollbacks of prepared transactions until it works,
even if there's been a local crash meanwhile.
Alternatively, you could rely on an external transaction manager to do
all that stuff. I don't have a clear sense of what that would entail,
or how it might be better or worse than rolling our own. I suspect,
though, that it might amount to little more than adding a middle man.
I mean, a third-party transaction manager isn't going to automatically
know how to commit a transaction prepared on some foreign server using
some foreign data wrapper. It's going to be have to be taught that if
postgres_fdw leaves a transaction in-medias-res on server OID 1234,
you've got to connect to the target machine using that foreign
server's connection parameters, speak libpq, and issue the appropriate
COMMIT TRANSACTION command. And similarly, you're going to need to
arrange to notify it before preparing that transaction so that it
knows that it needs to request the COMMIT or ABORT later on. Once
you've got all of that infrastructure for that in place, what are you
really gaining over just doing it in PostgreSQL (or, say, a contrib
module thereto)?
Thanks Robert for giving high level view of system needed for PostgreSQL to be a transaction manager by itself. Agreed completely.
(I'm also concerned that an external transaction manager might need
the PostgreSQL client to be aware of it, whereas what we'd really like
here is for the client to just speak PostgreSQL and be happy that its
commits no longer end up half-done.)
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I don't see why would my patch cause inconsistencies. It can > cause dangling PREPAREd transactions and I have already > acknowledged that fact. > > Am I missing something? To me that is the big problem. Where I have run into ad hoc distributed transaction managers it has usually been because a crash left prepared transactions dangling, without cleaning them up when the transaction manager was restarted. This tends to wreak havoc one way or another. If we are going to include a distributed transaction manager with PostgreSQL, it *must* persist enough information about the transaction ID and where it is used in a way that will survive a subsequent crash before beginning the PREPARE on any of the systems. After all nodes are PREPAREd it must flag that persisted data to indicate that it is now at a point where ROLLBACK is no longer an option. Only then can it start committing the prepared transactions. After the last node is committed it can clear this information. On start-up the distributed transaction manager must check for any distributed transactions left "in progress" and commit or rollback based on the preceding; doing retries indefinitely until it succeeds or is told to stop. Doing this incompletely (i.e., not identifying and correctly handling the various failure modes) is IMO far worse than not attempting it. If we could build in something that did this completely and well, that would be a cool selling point; but let's not gloss over the difficulties. We must recognize how big a problem it would be to include a low-quality implementation. Also, as previously mentioned, it must behave in some reasonable way if a database is not configured to support 2PC, especially since 2PC is off by default in PostgreSQL. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> I don't see why would my patch cause inconsistencies. It can
> cause dangling PREPAREd transactions and I have already
> acknowledged that fact.
>
> Am I missing something?
To me that is the big problem. Where I have run into ad hoc
distributed transaction managers it has usually been because a
crash left prepared transactions dangling, without cleaning them up
when the transaction manager was restarted. This tends to wreak
havoc one way or another.
If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems.
Thanks a lot. I hadn't thought of this.
After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.
Agreed.
Doing this incompletely (i.e., not identifying and correctly
handling the various failure modes) is IMO far worse than not
attempting it. If we could build in something that did this
completely and well, that would be a cool selling point; but let's
not gloss over the difficulties. We must recognize how big a
problem it would be to include a low-quality implementation.
Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.
I described one possibility in my reply to Tom's mail. Let me repeat it here.
We can have a per foreign server option, which says whether the corresponding server is able to participate in 2PC. A transaction spanning multiple foreign server with at least one of them not capable of participating in 2PC will be aborted.
Will that work?
In case a user flags a foreign server as capable to 2PC incorrectly, I expect the corresponding FDW would raise error (either because PREPARE fails or FDW doesn't handle that case) and the transaction will be aborted anyway.
--
Kevin Grittner
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Also, as previously mentioned, it must behave in some reasonable >> way if a database is not configured to support 2PC, especially >> since 2PC is off by default in PostgreSQL. > We can have a per foreign server option, which says whether the > corresponding server is able to participate in 2PC. A transaction > spanning multiple foreign server with at least one of them not > capable of participating in 2PC will be aborted. > > Will that work? > > In case a user flags a foreign server as capable to 2PC > incorrectly, I expect the corresponding FDW would raise error > (either because PREPARE fails or FDW doesn't handle that case) > and the transaction will be aborted anyway. That sounds like one way to handle it. I'm not clear on how you plan to determine whether 2PC is required for a transaction. (Apologies if it was previously mentioned and I've forgotten it.) I don't mean to suggest that these problems are insurmountable; I just think that people often underestimate the difficulty of writing a distributed transaction manager and don't always recognize the problems that it will cause if all of the failure modes are not considered and handled. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > If we are going to include a distributed transaction manager with > PostgreSQL, it *must* persist enough information about the > transaction ID and where it is used in a way that will survive a > subsequent crash before beginning the PREPARE on any of the > systems. After all nodes are PREPAREd it must flag that persisted > data to indicate that it is now at a point where ROLLBACK is no > longer an option. Only then can it start committing the prepared > transactions. After the last node is committed it can clear this > information. On start-up the distributed transaction manager must > check for any distributed transactions left "in progress" and > commit or rollback based on the preceding; doing retries > indefinitely until it succeeds or is told to stop. I think one key question here is whether all of this should be handled in PostgreSQL core or whether some of it should be handled in other ways. Is the goal to make postgres_fdw (and FDWs for other databases that support 2PC) to persist enough information that someone *could* write a transaction manager for PostgreSQL, or is the goal to actually write that transaction manager? Just figuring out how to persist the necessary information is a non-trivial problem by itself. You might think that you could just insert a row into a local table saying, hey, I'm about to prepare a transaction remotely, but of course that doesn't work: if you then go on to PREPARE before writing and flushing the local commit record, then a crash before that's done leaves a dangling prepared transaction on the remote note. You might think to write the record, then after writing and flush the local commit record do the PREPARE. But you can't do that either, because now if the PREPARE fails you've already committed locally. I guess what you need to do is something like: 1. Write and flush a WAL record indicating an intent to prepare, with a list of foreign server OIDs and GUIDs. 2. Prepare the remote transaction on each node. If any of those operations fail, roll back any prepared nodes and error out. 3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL). 4. Try to commit the remote transactions. 5. Write a WAL record indicating that you committed the remote transactions OK. If you fail after step 1, you can straighten things out by looking at the status of the transaction: if the transaction committed, any transactions we intended-to-prepare need to be checked. If they are still prepared, we need to commit them or roll them back according to what happened to our XID. (Andres is talking in my other ear suggesting that we ought to reuse the 2PC infrastructure to do all this. I'm not convinced that's a good idea, but I'll let him present his own ideas here if he wants to rather than trying to explain them myself.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Andres is talking in my other ear suggesting that we ought to > reuse the 2PC infrastructure to do all this. If you mean that the primary transaction and all FDWs in the transaction must use 2PC, that is what I was saying, although apparently not clearly enough. All nodes *including the local one* must be prepared and committed with data about the nodes saved safely off somewhere that it can be read in the event of a failure of any of the nodes *including the local one*. Without that, I see this whole approach as a train wreck just waiting to happen. I'm not really clear on the mechanism that is being proposed for doing this, but one way would be to have the PREPARE of the local transaction be requested explicitly and to have that cause all FDWs participating in the transaction to also be prepared. (That might be what Andres meant; I don't know.) That doesn't strike me as the only possible mechanism to drive this, but it might well be the simplest and cleanest. The trickiest bit might be to find a good way to persist the distributed transaction information in a way that survives the failure of the main transaction -- or even the abrupt loss of the machine it's running on. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> Andres is talking in my other ear suggesting that we ought to >> reuse the 2PC infrastructure to do all this. > > If you mean that the primary transaction and all FDWs in the > transaction must use 2PC, that is what I was saying, although > apparently not clearly enough. All nodes *including the local one* > must be prepared and committed with data about the nodes saved > safely off somewhere that it can be read in the event of a failure > of any of the nodes *including the local one*. Without that, I see > this whole approach as a train wreck just waiting to happen. Clearly, all the nodes other than the local one need to use 2PC. I am unconvinced that the local node must write a 2PC state file only to turn around and remove it again almost immediately thereafter. > I'm not really clear on the mechanism that is being proposed for > doing this, but one way would be to have the PREPARE of the local > transaction be requested explicitly and to have that cause all FDWs > participating in the transaction to also be prepared. (That might > be what Andres meant; I don't know.) We want this to be client-transparent, so that the client just says COMMIT and everything Just Works. > That doesn't strike me as the > only possible mechanism to drive this, but it might well be the > simplest and cleanest. The trickiest bit might be to find a good > way to persist the distributed transaction information in a way > that survives the failure of the main transaction -- or even the > abrupt loss of the machine it's running on. I'd be willing to punt on surviving a loss of the entire machine. But I'd like to be able to survive an abrupt reboot. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Robert Haas <robertmhaas@gmail.com> wrote: >>> Andres is talking in my other ear suggesting that we ought to >>> reuse the 2PC infrastructure to do all this. >> >> If you mean that the primary transaction and all FDWs in the >> transaction must use 2PC, that is what I was saying, although >> apparently not clearly enough. All nodes *including the local one* >> must be prepared and committed with data about the nodes saved >> safely off somewhere that it can be read in the event of a failure >> of any of the nodes *including the local one*. Without that, I see >> this whole approach as a train wreck just waiting to happen. > > Clearly, all the nodes other than the local one need to use 2PC. I am > unconvinced that the local node must write a 2PC state file only to > turn around and remove it again almost immediately thereafter. The key point is that the distributed transaction data must be flagged as needing to commit rather than roll back between the prepare phase and the final commit. If you try to avoid the PREPARE, flagging, COMMIT PREPARED sequence by building the flagging of the distributed transaction metadata into the COMMIT process, you still have the problem of what to do on crash recovery. You really need to use 2PC to keep that clean, I think. >> I'm not really clear on the mechanism that is being proposed for >> doing this, but one way would be to have the PREPARE of the local >> transaction be requested explicitly and to have that cause all FDWs >> participating in the transaction to also be prepared. (That might >> be what Andres meant; I don't know.) > > We want this to be client-transparent, so that the client just says > COMMIT and everything Just Works. What about the case where one or more nodes doesn't support 2PC. Do we silently make the choice, without the client really knowing? >> That doesn't strike me as the >> only possible mechanism to drive this, but it might well be the >> simplest and cleanest. The trickiest bit might be to find a good >> way to persist the distributed transaction information in a way >> that survives the failure of the main transaction -- or even the >> abrupt loss of the machine it's running on. > > I'd be willing to punt on surviving a loss of the entire machine. But > I'd like to be able to survive an abrupt reboot. As long as people are aware that there is an urgent need to find and fix all data stores to which clusters on the failed machine were connected via FDW when there is a hard machine failure, I guess it is OK. In essence we just document it and declare it to be somebody else's problem. In general I would expect a distributed transaction manager to behave well in the face of any single-machine failure, but if there is one aspect of a full-featured distributed transaction manager we could give up, I guess that would be it. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 8, 2015 at 7:02 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
> On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> Also, as previously mentioned, it must behave in some reasonable
>> way if a database is not configured to support 2PC, especially
>> since 2PC is off by default in PostgreSQL.
> We can have a per foreign server option, which says whether the
> corresponding server is able to participate in 2PC. A transaction
> spanning multiple foreign server with at least one of them not
> capable of participating in 2PC will be aborted.
>
> Will that work?
>
> In case a user flags a foreign server as capable to 2PC
> incorrectly, I expect the corresponding FDW would raise error
> (either because PREPARE fails or FDW doesn't handle that case)
> and the transaction will be aborted anyway.
That sounds like one way to handle it. I'm not clear on how you
plan to determine whether 2PC is required for a transaction.
(Apologies if it was previously mentioned and I've forgotten it.)
Any transaction involving more than one server (including local one, I guess), will require two PC. A transaction may modify and access remote database but not local one. In such a case, the state of local transaction doesn't matter once the remote transaction is committed or rolled back.
I don't mean to suggest that these problems are insurmountable; I
just think that people often underestimate the difficulty of
writing a distributed transaction manager and don't always
recognize the problems that it will cause if all of the failure
modes are not considered and handled.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Thu, Jan 8, 2015 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> If we are going to include a distributed transaction manager with
> PostgreSQL, it *must* persist enough information about the
> transaction ID and where it is used in a way that will survive a
> subsequent crash before beginning the PREPARE on any of the
> systems. After all nodes are PREPAREd it must flag that persisted
> data to indicate that it is now at a point where ROLLBACK is no
> longer an option. Only then can it start committing the prepared
> transactions. After the last node is committed it can clear this
> information. On start-up the distributed transaction manager must
> check for any distributed transactions left "in progress" and
> commit or rollback based on the preceding; doing retries
> indefinitely until it succeeds or is told to stop.
I think one key question here is whether all of this should be handled
in PostgreSQL core or whether some of it should be handled in other
ways. Is the goal to make postgres_fdw (and FDWs for other databases
that support 2PC) to persist enough information that someone *could*
write a transaction manager for PostgreSQL, or is the goal to actually
write that transaction manager?
Just figuring out how to persist the necessary information is a
non-trivial problem by itself. You might think that you could just
insert a row into a local table saying, hey, I'm about to prepare a
transaction remotely, but of course that doesn't work: if you then go
on to PREPARE before writing and flushing the local commit record,
then a crash before that's done leaves a dangling prepared transaction
on the remote note. You might think to write the record, then after
writing and flush the local commit record do the PREPARE. But you
can't do that either, because now if the PREPARE fails you've already
committed locally.
I guess what you need to do is something like:
1. Write and flush a WAL record indicating an intent to prepare, with
a list of foreign server OIDs and GUIDs.
2. Prepare the remote transaction on each node. If any of those
operations fail, roll back any prepared nodes and error out.
3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL).
4. Try to commit the remote transactions.
5. Write a WAL record indicating that you committed the remote transactions OK.
If you fail after step 1, you can straighten things out by looking at
the status of the transaction: if the transaction committed, any
transactions we intended-to-prepare need to be checked. If they are
still prepared, we need to commit them or roll them back according to
what happened to our XID.
When you want to strengthen and commit things, the foreign server may not be available to do that. As Kevin pointed out in above, we need to keep on retrying to resolve (commit or rollback based on the status of local transaction) the PREPAREd transactions on foreign server till they are resolved. So, we will have to persist the information somewhere else than the WAL OR keep on persisting the WALs even after the corresponding local transaction has been committed or aborted, which I don't think is a good idea, since that will have impact on replication, VACUUM esp. because it's going to affect the oldest transaction in WAL.
That's where Andres's suggestion might help.
(Andres is talking in my other ear suggesting that we ought to reuse
the 2PC infrastructure to do all this. I'm not convinced that's a
good idea, but I'll let him present his own ideas here if he wants to
rather than trying to explain them myself.)
We can persist the information about distributed transaction (which esp. require 2PC) similar to the way as 2PC infrastructure in pg_twophase directory. I am still investigating whether we can re-use existing 2PC infrastructure or not. My initial reaction is no, since 2PC persists information about local transaction including locked objects, WALs (?) in pg_twophase directory, which is not required for a distributed transaction. But rest of the mechanism like the manner of processing the records during normal processing and recovery looks very useful.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 1/8/15, 12:00 PM, Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote: >>> Robert Haas <robertmhaas@gmail.com> wrote: >>>> Andres is talking in my other ear suggesting that we ought to >>>> reuse the 2PC infrastructure to do all this. >>> >>> If you mean that the primary transaction and all FDWs in the >>> transaction must use 2PC, that is what I was saying, although >>> apparently not clearly enough. All nodes *including the local one* >>> must be prepared and committed with data about the nodes saved >>> safely off somewhere that it can be read in the event of a failure >>> of any of the nodes *including the local one*. Without that, I see >>> this whole approach as a train wreck just waiting to happen. >> >> Clearly, all the nodes other than the local one need to use 2PC. I am >> unconvinced that the local node must write a 2PC state file only to >> turn around and remove it again almost immediately thereafter. > > The key point is that the distributed transaction data must be > flagged as needing to commit rather than roll back between the > prepare phase and the final commit. If you try to avoid the > PREPARE, flagging, COMMIT PREPARED sequence by building the > flagging of the distributed transaction metadata into the COMMIT > process, you still have the problem of what to do on crash > recovery. You really need to use 2PC to keep that clean, I think. If we had an independent transaction coordinator then I agree with you Kevin. I think Robert is proposing that if we arecontrolling one of the nodes that's participating as well as coordinating the overall transaction that we can take someshortcuts. AIUI a PREPARE means you are completely ready to commit. In essence you're just waiting to write and fsyncthe commit message. That is in fact the state that a coordinating PG node would be in by the time everyone else hasdone their prepare. So from that standpoint we're OK. Now, as soon as ANY of the nodes commit, our coordinating node MUST be able to commit as well! That would require it to havea real prepared transaction of it's own created. However, as long as there is zero chance of any other prepared transactionscommitting before our local transaction, that step isn't actually needed. Our local transaction will either commitor abort, and that will determine what needs to happen on all other nodes. I'm ignoring the question of how the local node needs to store info about the other nodes in case of a crash, but AFAICTyou could reliably recover manually from what I just described. I think the question is: are we OK with "going under the skirt" in this fashion? Presumably it would provide better performance,whereas forcing ourselves to eat our own 2PC dogfood would presumably make it easier for someone to plugin anexternal coordinator instead of using our own. I think there's also a lot to be said for getting a partial implementationof this available today (requiring manual recovery), so long as it's not in core. BTW, I found https://www.cs.rutgers.edu/~pxk/417/notes/content/transactions.html a useful read, specifically the 2PC portion. >>> I'm not really clear on the mechanism that is being proposed for >>> doing this, but one way would be to have the PREPARE of the local >>> transaction be requested explicitly and to have that cause all FDWs >>> participating in the transaction to also be prepared. (That might >>> be what Andres meant; I don't know.) >> >> We want this to be client-transparent, so that the client just says >> COMMIT and everything Just Works. > > What about the case where one or more nodes doesn't support 2PC. > Do we silently make the choice, without the client really knowing? We abort. (Unless we want to have a running_with_scissors GUC...) >>> That doesn't strike me as the >>> only possible mechanism to drive this, but it might well be the >>> simplest and cleanest. The trickiest bit might be to find a good >>> way to persist the distributed transaction information in a way >>> that survives the failure of the main transaction -- or even the >>> abrupt loss of the machine it's running on. >> >> I'd be willing to punt on surviving a loss of the entire machine. But >> I'd like to be able to survive an abrupt reboot. > > As long as people are aware that there is an urgent need to find > and fix all data stores to which clusters on the failed machine > were connected via FDW when there is a hard machine failure, I > guess it is OK. In essence we just document it and declare it to > be somebody else's problem. In general I would expect a > distributed transaction manager to behave well in the face of any > single-machine failure, but if there is one aspect of a > full-featured distributed transaction manager we could give up, I > guess that would be it. ISTM that one option here would be to "simply" write and sync WAL record(s) of all externally prepared transactions. Thatwould be enough for a hot standby to find all the other servers and tell them to either commit or abort, based on whetherour local transaction committed or aborted. If you wanted, you could even have the standby be responsible for tellingall the other participants to commit... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sat, Jan 10, 2015 at 9:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/8/15, 12:00 PM, Kevin Grittner wrote: >> The key point is that the distributed transaction data must be >> flagged as needing to commit rather than roll back between the >> prepare phase and the final commit. If you try to avoid the >> PREPARE, flagging, COMMIT PREPARED sequence by building the >> flagging of the distributed transaction metadata into the COMMIT >> process, you still have the problem of what to do on crash >> recovery. You really need to use 2PC to keep that clean, I think. Yes, 2PC is needed as long as more than 2 nodes perform write operations within a transaction. > If we had an independent transaction coordinator then I agree with you > Kevin. I think Robert is proposing that if we are controlling one of the > nodes that's participating as well as coordinating the overall transaction > that we can take some shortcuts. AIUI a PREPARE means you are completely > ready to commit. In essence you're just waiting to write and fsync the > commit message. That is in fact the state that a coordinating PG node would > be in by the time everyone else has done their prepare. So from that > standpoint we're OK. > > Now, as soon as ANY of the nodes commit, our coordinating node MUST be able > to commit as well! That would require it to have a real prepared transaction > of it's own created. However, as long as there is zero chance of any other > prepared transactions committing before our local transaction, that step > isn't actually needed. Our local transaction will either commit or abort, > and that will determine what needs to happen on all other nodes. It is a property of 2PC to ensure that a prepared transaction will commit. Now, once it is confirmed on the coordinator that all the remote nodes have successfully PREPAREd, the coordinator issues COMMIT PREPARED to each node. What do you do if some nodes report ABORT PREPARED while other nodes report COMMIT PREPARED? Do you abort the transaction on coordinator, commit it or FATAL? This lets the cluster in an inconsistent state, meaning that some consistent cluster-wide recovery point is needed as well (Postgres-XC and XL have introduced the concept of barriers for such problems, stuff created first by Pavan Deolassee). -- Michael
On 1/10/15, 7:11 AM, Michael Paquier wrote: >> If we had an independent transaction coordinator then I agree with you >> >Kevin. I think Robert is proposing that if we are controlling one of the >> >nodes that's participating as well as coordinating the overall transaction >> >that we can take some shortcuts. AIUI a PREPARE means you are completely >> >ready to commit. In essence you're just waiting to write and fsync the >> >commit message. That is in fact the state that a coordinating PG node would >> >be in by the time everyone else has done their prepare. So from that >> >standpoint we're OK. >> > >> >Now, as soon as ANY of the nodes commit, our coordinating node MUST be able >> >to commit as well! That would require it to have a real prepared transaction >> >of it's own created. However, as long as there is zero chance of any other >> >prepared transactions committing before our local transaction, that step >> >isn't actually needed. Our local transaction will either commit or abort, >> >and that will determine what needs to happen on all other nodes. > It is a property of 2PC to ensure that a prepared transaction will > commit. Now, once it is confirmed on the coordinator that all the > remote nodes have successfully PREPAREd, the coordinator issues COMMIT > PREPARED to each node. What do you do if some nodes report ABORT > PREPARED while other nodes report COMMIT PREPARED? Do you abort the > transaction on coordinator, commit it or FATAL? This lets the cluster > in an inconsistent state, meaning that some consistent cluster-wide > recovery point is needed as well (Postgres-XC and XL have introduced > the concept of barriers for such problems, stuff created first by > Pavan Deolassee). My understanding is that once you get a successful PREPARE that should mean that it's basically impossible for the transactionto fail to commit. If that's not the case, I fail to see how you can get any decent level of sanity out of this... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, Jan 11, 2015 at 10:37 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/10/15, 7:11 AM, Michael Paquier wrote: >>> >>> If we had an independent transaction coordinator then I agree with you >>> >Kevin. I think Robert is proposing that if we are controlling one of the >>> >nodes that's participating as well as coordinating the overall >>> > transaction >>> >that we can take some shortcuts. AIUI a PREPARE means you are completely >>> >ready to commit. In essence you're just waiting to write and fsync the >>> >commit message. That is in fact the state that a coordinating PG node >>> > would >>> >be in by the time everyone else has done their prepare. So from that >>> >standpoint we're OK. >>> > >>> >Now, as soon as ANY of the nodes commit, our coordinating node MUST be >>> > able >>> >to commit as well! That would require it to have a real prepared >>> > transaction >>> >of it's own created. However, as long as there is zero chance of any >>> > other >>> >prepared transactions committing before our local transaction, that step >>> >isn't actually needed. Our local transaction will either commit or >>> > abort, >>> >and that will determine what needs to happen on all other nodes. >> >> It is a property of 2PC to ensure that a prepared transaction will >> commit. Now, once it is confirmed on the coordinator that all the >> remote nodes have successfully PREPAREd, the coordinator issues COMMIT >> PREPARED to each node. What do you do if some nodes report ABORT >> PREPARED while other nodes report COMMIT PREPARED? Do you abort the >> transaction on coordinator, commit it or FATAL? This lets the cluster >> in an inconsistent state, meaning that some consistent cluster-wide >> recovery point is needed as well (Postgres-XC and XL have introduced >> the concept of barriers for such problems, stuff created first by >> Pavan Deolassee). > > > My understanding is that once you get a successful PREPARE that should mean > that it's basically impossible for the transaction to fail to commit. If > that's not the case, I fail to see how you can get any decent level of > sanity out of this... When giving the responsability of a group of COMMIT PREPARED to a set of nodes in a network, there could be a couple of problems showing up, of the type split-brain for example. There could be as well failures at hardware-level, so you would need a mechanism ensuring that WAL is consistent among all the nodes, with for example the addition of a common restore point on all the nodes once PREPARE is successfully done with for example XLOG_RESTORE_POINT. That's a reason why I think that the local Coordinator should use 2PC as well, to ensure a consistency point once all the remote nodes have successfully PREPAREd, and a reason why things can get complicated for either the DBA or the upper application in charge of ensuring the DB consistency even in case of critical failures. -- Michael
On Thu, Jan 8, 2015 at 1:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Clearly, all the nodes other than the local one need to use 2PC. I am >> unconvinced that the local node must write a 2PC state file only to >> turn around and remove it again almost immediately thereafter. > > The key point is that the distributed transaction data must be > flagged as needing to commit rather than roll back between the > prepare phase and the final commit. If you try to avoid the > PREPARE, flagging, COMMIT PREPARED sequence by building the > flagging of the distributed transaction metadata into the COMMIT > process, you still have the problem of what to do on crash > recovery. You really need to use 2PC to keep that clean, I think. I don't really follow this. You need to write a list of the transactions that you're going to prepare to stable storage before preparing any of them. And then you need to write something to stable storage when you've definitively determined that you're going to commit. But we have no current mechanism for the first thing (so reusing 2PC doesn't help) and we already have the second thing (it's the commit record itself). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 11, 2015 at 3:36 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> My understanding is that once you get a successful PREPARE that should mean >> that it's basically impossible for the transaction to fail to commit. If >> that's not the case, I fail to see how you can get any decent level of >> sanity out of this... > When giving the responsability of a group of COMMIT PREPARED to a set > of nodes in a network, there could be a couple of problems showing up, > of the type split-brain for example. I think this is just confusing the issue. When a machine reports that a transaction is successfully prepared, any future COMMIT PREPARED operation *must* succeed. If it doesn't, the machine has broken its promises, and that's not OK. Period. It doesn't matter whether that's due to split-brain or sunspots or Oscar Wilde having bad breath. If you say that it's prepared, then you're not allowed to change your mind later and say that it can't be committed. If you do, then you have a broken 2PC implementation and, as Jim says, all bets are off. Now of course nothing is certain in life except death and taxes. If you PREPARE a transaction, and then go into the data directory and corrupt the 2PC state file using dd, and then try to commit it, it might fail. But no system can survive that sort of thing, whether 2PC is involved or not; in such extraordinary situations, of course operator intervention will be required. But in a more normal situation where you just have a failover, if the failover causes your prepared transaction to come unprepared, that means your failover mechanism is broken. If you're using synchronous replication, this shouldn't happen. > There could be as well failures > at hardware-level, so you would need a mechanism ensuring that WAL is > consistent among all the nodes, with for example the addition of a > common restore point on all the nodes once PREPARE is successfully > done with for example XLOG_RESTORE_POINT. That's a reason why I think > that the local Coordinator should use 2PC as well, to ensure a > consistency point once all the remote nodes have successfully > PREPAREd, and a reason why things can get complicated for either the > DBA or the upper application in charge of ensuring the DB consistency > even in case of critical failures. It's up to the DBA to decide whether they care about surviving complete loss of a node while having 2PC still work. If they do, they should use sync rep, and they should be fine -- the machine on which the transaction is prepared shouldn't acknowledge the PREPARE as having succeeded until the WAL is safely on disk on the standby. Most probably don't, though; that's a big performance penalty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi All,
Here are the steps and infrastructure for achieving atomic commits across multiple foreign servers. I have tried to address most of the concerns raised in this mail thread before. Let me know, if I have left something. Attached is a WIP patch implementing the same for postgres_fdw. I have tried to make it FDW-independent.
A. Steps during transaction processing
------------------------------------------------
------------------------------------------------
1. When an FDW connects to a foreign server and starts a transaction, it registers that server with a boolean flag indicating whether that server is capable of participating in a two phase commit. In the patch this is implemented using function RegisterXactForeignServer(), which raises an error, thus aborting the transaction, if there is at least one foreign server incapable of 2PC in a multiserver transaction. This error thrown as early as possible. If all the foreign servers involved in the transaction are capable of 2PC, the function just updates the information. As of now, in the patch the function is in the form of a stub.
Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it can build the capabilities which can be used for all the servers using file_fdw
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it can build the capabilities which can be used for all the servers using file_fdw
b. a decision based on server version type etc. thus FDW can decide that by looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form of CREATE/ALTER SERVER option. Implemented in the patch.
For a transaction involving only a single foreign server, the current code remains unaltered as two phase commit is not needed. Rest of the discussion pertains to a transaction involving more than one foreign servers.
At the commit or abort time, the FDW receives call backs with the appropriate events. FDW then takes following actions on each event.
2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction id per foreign server involved and saves it along with xid, dbid, foreign server id and user mapping and foreign transaction status = PREPARING in-memory. The prepared transaction id can be anything represented as byte string. Same information is flushed to the disk to survive crashes. This is implemented in the patch as prepare_foreign_xact(). Persistent and in-memory storages and their usages are discussed later in the mail. FDW then prepares the transaction on the foreign server. If this step is successful, the foreign transaction status is changed to PREPARED. If the step is unsuccessful, the local transaction is aborted and each FDW will receive XACT_EVENT_ABORT (discussed later). The updates to the foreign transaction status need not be flushed to the disk, as they can be inferred from the status of local transaction.
3. If the local transaction is committed, the FDW callback will get XACT_EVENT_COMMIT event. Foreign transaction status is changed to COMMITTING. FDW tries to commit the foreign transaction with the prepared transaction id. If the commit is successful, the foreign transaction entry is removed. If the commit is unsuccessful because of local/foreign server crash or network failure, the foreign prepared transaction resolver takes care of the committing it at later point of time.
4. If the local transaction is aborted, the FDW callback will get XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared a transaction on foreign server as per step 1 above. If it has not prepared the transaction, it simply aborts the transaction on foreign server; a server crash or network failure doesn't alter the ultimate result in this case. If FDW has prepared the foreign transaction, it updates the foreign transaction status as ABORTING and tries to rollback the prepared transaction. If the rollback is successful, the foreign transaction entry is removed. If the rollback is not successful, the foreign prepared transaction resolver takes care of aborting it at later point of time.
B. Foreign prepared transaction resolver
---------------------------------------------------
---------------------------------------------------
In the patch this is implemented as a built-in function pg_fdw_resolve(). Ideally the functionality should be run by a background worker process frequently.
The resolver looks at each entry and invokes the FDW routine to resolve the transaction. The FDW routine returns boolean status: true if the prepared transaction was resolved (committed/aborted), false otherwise.
The resolution is as follows -
1. If foreign transaction status is COMMITTING or ABORTING, commits or aborts the prepared transaction resp through the FDW routine. If the transaction is successfully resolved, it removes the foreign transaction entry.
2. Else, it checks if the local transaction was committed or aborted, it update the foreign transaction status accordingly and takes the action according to above step 1.
3. The resolver doesn't touch entries created by in-progress local transactions.
If server/backend crashes after it has registered the foreign transaction entry (during step A.1), we will be left with a prepared transaction id, which was never prepared on the foreign server. Similarly the server/backend crashes after it has resolved the foreign prepared transaction but before removing the entry, same situation can arise. FDW should detect these situations, when foreign server complains about non-existing prepared transaction ids and consider such foreign transactions as resolved.
After looking at all the entries the resolver flushes the entries to the disk, so as to retain the latest status across shutdown and crash.
C. Other methods and infrastructure
------------------------------------------------
------------------------------------------------
1. Method to show the current foreign transaction entries (in progress or waiting to be resolved). Implemented as function pg_fdw_xact() in the patch.
2. Method to drop foreign transaction entries in case they are resolved by user/DBA themselves. Not implemented in the patch.
3. Method to prevent altering or dropping foreign server and user mapping used to prepare the foreign transaction till the later gets resolved. Not implemented in the patch. While altering or dropping the foreign server or user mapping, that portion of the code needs to check if there exists an foreign transaction entry depending upon the foreign server or user mapping and should error out.
4. The information about the xid needs to be available till it is decided whether to commit or abort the foreign transaction and that decision is persisted. That should put some constraint on the xid wraparound or oldest active transaction. Not implemented in the patch.
5. Method to propagate the foreign transaction information to the slave.
D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved and manipulated in shared memory. They are written to file whenever persistence is necessary e.g. while registering the foreign transaction in step A.2. Requirements C.1, C.2 need some SQL interface in the form of built-in functions or SQL commands.
The patch implements the in-memory foreign transaction table as a fixed size array of foreign transaction entries (similar to prepared transaction entries in twophase.c). This puts a restriction on number of foreign prepared transactions that need to be maintained at a time. We need separate locks to syncronize the access to the shared memory; the patch uses only a single LW lock. There is restriction on the length of prepared transaction id (or prepared transaction information saved by FDW to be general), since everything is being saved in fixed size memory. We may be able to overcome that restriction by writing this information to separate files (one file per foreign prepared transaction). We need to take the same route as 2PC for C.5.
The patch implements the in-memory foreign transaction table as a fixed size array of foreign transaction entries (similar to prepared transaction entries in twophase.c). This puts a restriction on number of foreign prepared transactions that need to be maintained at a time. We need separate locks to syncronize the access to the shared memory; the patch uses only a single LW lock. There is restriction on the length of prepared transaction id (or prepared transaction information saved by FDW to be general), since everything is being saved in fixed size memory. We may be able to overcome that restriction by writing this information to separate files (one file per foreign prepared transaction). We need to take the same route as 2PC for C.5.
2. New catalog - This method takes out the need to have separate method for C1, C5 and even C2, also the synchronization will be taken care of by row locks, there will be no limit on the number of foreign transactions as well as the size of foreign prepared transaction information. But big problem with this approach is that, the changes to the catalogs are atomic with the local transaction. If a foreign prepared transaction can not be aborted while the local transaction is rolled back, that entry needs to retained. But since the local transaction is aborting the corresponding catalog entry would become invisible and thus unavailable to the resolver (alas! we do not have autonomous transaction support). We may be able to overcome this, by simulating autonomous transaction through a background worker (which can also act as a resolver). But the amount of communication and synchronization, might affect the performance.
A mixed approach where the backend shifts the entries from storage in approach 1 to catalog, thus lifting the constraints on size is possible, but is very complicated.
A mixed approach where the backend shifts the entries from storage in approach 1 to catalog, thus lifting the constraints on size is possible, but is very complicated.
Any other ideas to use catalog table as the persistent storage here? Does anybody think, catalog table is a viable option?
3. WAL records - Since the algorithm follows "write ahead of action", WAL seems to be a possible way to persist the foreign transaction entries. But WAL records can not be used for repeated scan as is required by the foreign transaction resolver. Also, replaying WALs is controlled by checkpoint, so not all WALs are replayed. If a checkpoint happens after a foreign prepared transaction remains resolved, corresponding WALs will never be replayed, thus causing the foreign prepared transaction to remain unresolved forever without a clue. So, WALs alone don't seem to be a fit here.
The algorithms rely on the FDWs to take right steps to the large extent, rather than controlling each step explicitly. It expects the FDWs to take the right steps for each event and call the right functions to manipulate foreign transaction entries. It does not ensure the correctness of these steps, by say examining the foreign transaction entries in response to each event or by making the callback return the information and manipulate the entries within the core. I am willing to go the stricter but more intrusive route if the others also think that way. Otherwise, the current approach is less intrusive and I am fine with that too.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
Added to 2015-06 commitfest to attract some reviews and comments.
On Tue, Feb 17, 2015 at 2:56 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
Hi All,Here are the steps and infrastructure for achieving atomic commits across multiple foreign servers. I have tried to address most of the concerns raised in this mail thread before. Let me know, if I have left something. Attached is a WIP patch implementing the same for postgres_fdw. I have tried to make it FDW-independent.A. Steps during transaction processing
------------------------------------------------1. When an FDW connects to a foreign server and starts a transaction, it registers that server with a boolean flag indicating whether that server is capable of participating in a two phase commit. In the patch this is implemented using function RegisterXactForeignServer(), which raises an error, thus aborting the transaction, if there is at least one foreign server incapable of 2PC in a multiserver transaction. This error thrown as early as possible. If all the foreign servers involved in the transaction are capable of 2PC, the function just updates the information. As of now, in the patch the function is in the form of a stub.Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it can build the capabilities which can be used for all the servers using file_fdwb. a decision based on server version type etc. thus FDW can decide that by looking at the server properties for each serverc. a user decision where the FDW can allow a user to specify it in the form of CREATE/ALTER SERVER option. Implemented in the patch.
For a transaction involving only a single foreign server, the current code remains unaltered as two phase commit is not needed. Rest of the discussion pertains to a transaction involving more than one foreign servers.At the commit or abort time, the FDW receives call backs with the appropriate events. FDW then takes following actions on each event.2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction id per foreign server involved and saves it along with xid, dbid, foreign server id and user mapping and foreign transaction status = PREPARING in-memory. The prepared transaction id can be anything represented as byte string. Same information is flushed to the disk to survive crashes. This is implemented in the patch as prepare_foreign_xact(). Persistent and in-memory storages and their usages are discussed later in the mail. FDW then prepares the transaction on the foreign server. If this step is successful, the foreign transaction status is changed to PREPARED. If the step is unsuccessful, the local transaction is aborted and each FDW will receive XACT_EVENT_ABORT (discussed later). The updates to the foreign transaction status need not be flushed to the disk, as they can be inferred from the status of local transaction.3. If the local transaction is committed, the FDW callback will get XACT_EVENT_COMMIT event. Foreign transaction status is changed to COMMITTING. FDW tries to commit the foreign transaction with the prepared transaction id. If the commit is successful, the foreign transaction entry is removed. If the commit is unsuccessful because of local/foreign server crash or network failure, the foreign prepared transaction resolver takes care of the committing it at later point of time.4. If the local transaction is aborted, the FDW callback will get XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared a transaction on foreign server as per step 1 above. If it has not prepared the transaction, it simply aborts the transaction on foreign server; a server crash or network failure doesn't alter the ultimate result in this case. If FDW has prepared the foreign transaction, it updates the foreign transaction status as ABORTING and tries to rollback the prepared transaction. If the rollback is successful, the foreign transaction entry is removed. If the rollback is not successful, the foreign prepared transaction resolver takes care of aborting it at later point of time.B. Foreign prepared transaction resolver
---------------------------------------------------In the patch this is implemented as a built-in function pg_fdw_resolve(). Ideally the functionality should be run by a background worker process frequently.The resolver looks at each entry and invokes the FDW routine to resolve the transaction. The FDW routine returns boolean status: true if the prepared transaction was resolved (committed/aborted), false otherwise.The resolution is as follows -1. If foreign transaction status is COMMITTING or ABORTING, commits or aborts the prepared transaction resp through the FDW routine. If the transaction is successfully resolved, it removes the foreign transaction entry.2. Else, it checks if the local transaction was committed or aborted, it update the foreign transaction status accordingly and takes the action according to above step 1.3. The resolver doesn't touch entries created by in-progress local transactions.
If server/backend crashes after it has registered the foreign transaction entry (during step A.1), we will be left with a prepared transaction id, which was never prepared on the foreign server. Similarly the server/backend crashes after it has resolved the foreign prepared transaction but before removing the entry, same situation can arise. FDW should detect these situations, when foreign server complains about non-existing prepared transaction ids and consider such foreign transactions as resolved.After looking at all the entries the resolver flushes the entries to the disk, so as to retain the latest status across shutdown and crash.C. Other methods and infrastructure
------------------------------------------------1. Method to show the current foreign transaction entries (in progress or waiting to be resolved). Implemented as function pg_fdw_xact() in the patch.2. Method to drop foreign transaction entries in case they are resolved by user/DBA themselves. Not implemented in the patch.3. Method to prevent altering or dropping foreign server and user mapping used to prepare the foreign transaction till the later gets resolved. Not implemented in the patch. While altering or dropping the foreign server or user mapping, that portion of the code needs to check if there exists an foreign transaction entry depending upon the foreign server or user mapping and should error out.4. The information about the xid needs to be available till it is decided whether to commit or abort the foreign transaction and that decision is persisted. That should put some constraint on the xid wraparound or oldest active transaction. Not implemented in the patch.5. Method to propagate the foreign transaction information to the slave.D. Persistent and in-memory storage considerations
--------------------------------------------------------------------I considered following options for persistent storage1. in-memory table and file(s) - The foreign transaction entries are saved and manipulated in shared memory. They are written to file whenever persistence is necessary e.g. while registering the foreign transaction in step A.2. Requirements C.1, C.2 need some SQL interface in the form of built-in functions or SQL commands.
The patch implements the in-memory foreign transaction table as a fixed size array of foreign transaction entries (similar to prepared transaction entries in twophase.c). This puts a restriction on number of foreign prepared transactions that need to be maintained at a time. We need separate locks to syncronize the access to the shared memory; the patch uses only a single LW lock. There is restriction on the length of prepared transaction id (or prepared transaction information saved by FDW to be general), since everything is being saved in fixed size memory. We may be able to overcome that restriction by writing this information to separate files (one file per foreign prepared transaction). We need to take the same route as 2PC for C.5.2. New catalog - This method takes out the need to have separate method for C1, C5 and even C2, also the synchronization will be taken care of by row locks, there will be no limit on the number of foreign transactions as well as the size of foreign prepared transaction information. But big problem with this approach is that, the changes to the catalogs are atomic with the local transaction. If a foreign prepared transaction can not be aborted while the local transaction is rolled back, that entry needs to retained. But since the local transaction is aborting the corresponding catalog entry would become invisible and thus unavailable to the resolver (alas! we do not have autonomous transaction support). We may be able to overcome this, by simulating autonomous transaction through a background worker (which can also act as a resolver). But the amount of communication and synchronization, might affect the performance.
A mixed approach where the backend shifts the entries from storage in approach 1 to catalog, thus lifting the constraints on size is possible, but is very complicated.Any other ideas to use catalog table as the persistent storage here? Does anybody think, catalog table is a viable option?3. WAL records - Since the algorithm follows "write ahead of action", WAL seems to be a possible way to persist the foreign transaction entries. But WAL records can not be used for repeated scan as is required by the foreign transaction resolver. Also, replaying WALs is controlled by checkpoint, so not all WALs are replayed. If a checkpoint happens after a foreign prepared transaction remains resolved, corresponding WALs will never be replayed, thus causing the foreign prepared transaction to remain unresolved forever without a clue. So, WALs alone don't seem to be a fit here.The algorithms rely on the FDWs to take right steps to the large extent, rather than controlling each step explicitly. It expects the FDWs to take the right steps for each event and call the right functions to manipulate foreign transaction entries. It does not ensure the correctness of these steps, by say examining the foreign transaction entries in response to each event or by making the callback return the information and manipulate the entries within the core. I am willing to go the stricter but more intrusive route if the others also think that way. Otherwise, the current approach is less intrusive and I am fine with that too.
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 02/17/2015 11:26 AM, Ashutosh Bapat wrote: > Hi All, > > Here are the steps and infrastructure for achieving atomic commits across > multiple foreign servers. I have tried to address most of the concerns > raised in this mail thread before. Let me know, if I have left something. > Attached is a WIP patch implementing the same for postgres_fdw. I have > tried to make it FDW-independent. Wow, this is going to be a lot of new infrastructure. This is going to need good documentation, explaining how two-phase commit works in general, how it's implemented, how to monitor it etc. It's important to explain all the possible failure scenarios where you're left with in-doubt transactions, and how the DBA can resolve them. Since we're building a Transaction Manager into PostgreSQL, please put a lot of thought on what kind of APIs it provides to the rest of the system. APIs for monitoring it, configuring it, etc. And how an extension could participate in a transaction, without necessarily being an FDW. Regarding the configuration, there are many different behaviours that an FDW could implement: 1. The FDW is read-only. Commit/abort behaviour is moot. 2. Transactions are not supported. All updates happen immediately regardless of the local transaction. 3. Transactions are supported, but two-phase commit is not. There are three different ways we can use the remote transactions in that case: 3.1. Commit the remote transaction before local transaction. 3.2. Commit the remote transaction after local transaction. 3.3. As long as there is only one such FDW involved, we can still do safe two-phase commit using so-called Last Resource Optimization. 4. Full two-phases commit support We don't necessarily have to support all of that, but let's keep all these cases in mind when we design the how to configure FDWs. There's more to it than "does it support 2PC". > A. Steps during transaction processing > ------------------------------------------------ > > 1. When an FDW connects to a foreign server and starts a transaction, it > registers that server with a boolean flag indicating whether that server is > capable of participating in a two phase commit. In the patch this is > implemented using function RegisterXactForeignServer(), which raises an > error, thus aborting the transaction, if there is at least one foreign > server incapable of 2PC in a multiserver transaction. This error thrown as > early as possible. If all the foreign servers involved in the transaction > are capable of 2PC, the function just updates the information. As of now, > in the patch the function is in the form of a stub. > > Whether a foreign server is capable of 2PC, can be > a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it > can build the capabilities which can be used for all the servers using > file_fdw > b. a decision based on server version type etc. thus FDW can decide that by > looking at the server properties for each server > c. a user decision where the FDW can allow a user to specify it in the form > of CREATE/ALTER SERVER option. Implemented in the patch. > > For a transaction involving only a single foreign server, the current code > remains unaltered as two phase commit is not needed. Just to be clear: you also need two-phase commit if the transaction updated anything in the local server and in even one foreign server. > D. Persistent and in-memory storage considerations > -------------------------------------------------------------------- > I considered following options for persistent storage > 1. in-memory table and file(s) - The foreign transaction entries are saved > and manipulated in shared memory. They are written to file whenever > persistence is necessary e.g. while registering the foreign transaction in > step A.2. Requirements C.1, C.2 need some SQL interface in the form of > built-in functions or SQL commands. > > The patch implements the in-memory foreign transaction table as a fixed > size array of foreign transaction entries (similar to prepared transaction > entries in twophase.c). This puts a restriction on number of foreign > prepared transactions that need to be maintained at a time. We need > separate locks to syncronize the access to the shared memory; the patch > uses only a single LW lock. There is restriction on the length of prepared > transaction id (or prepared transaction information saved by FDW to be > general), since everything is being saved in fixed size memory. We may be > able to overcome that restriction by writing this information to separate > files (one file per foreign prepared transaction). We need to take the same > route as 2PC for C.5. Your current approach with a file that's flushed to disk on every update has a few problems. Firstly, it's not crash safe. Secondly, if you make it crash-safe with fsync(), performance will suffer. You're going to need to need several fsyncs per commit with 2PC anyway, there's no way around that, but the scalable way to do that is to use the WAL so that one fsync() can flush more than one update in one operation. So I think you'll need to do something similar to the pg_twophase files. WAL-log each update, and only flush the file/files to disk on a checkpoint. Perhaps you could use the pg_twophase infrastructure for this directly, by essentially treating every local transaction as a two-phase transaction, with some extra flag to indicate that it's an internally-created one. > 2. New catalog - This method takes out the need to have separate method for > C1, C5 and even C2, also the synchronization will be taken care of by row > locks, there will be no limit on the number of foreign transactions as well > as the size of foreign prepared transaction information. But big problem > with this approach is that, the changes to the catalogs are atomic with the > local transaction. If a foreign prepared transaction can not be aborted > while the local transaction is rolled back, that entry needs to retained. > But since the local transaction is aborting the corresponding catalog entry > would become invisible and thus unavailable to the resolver (alas! we do > not have autonomous transaction support). We may be able to overcome this, > by simulating autonomous transaction through a background worker (which can > also act as a resolver). But the amount of communication and > synchronization, might affect the performance. Or you could insert/update the rows in the catalog with xmin=FrozenXid, ignoring MVCC. Not sure how well that would work. > 3. WAL records - Since the algorithm follows "write ahead of action", WAL > seems to be a possible way to persist the foreign transaction entries. But > WAL records can not be used for repeated scan as is required by the foreign > transaction resolver. Also, replaying WALs is controlled by checkpoint, so > not all WALs are replayed. If a checkpoint happens after a foreign prepared > transaction remains resolved, corresponding WALs will never be replayed, > thus causing the foreign prepared transaction to remain unresolved forever > without a clue. So, WALs alone don't seem to be a fit here. Right. The pg_twophase files solve that exact same issue. There is clearly a lot of work to do here. I'm marking this as Returned with Feedback in the commitfest, I don't think more review is going to be helpful at this point. - Heikki
Hi All,
I have been working on improving the previous implementation and addressing TODOs in my previous mail. Let me explain the approach first and I will get to Heikki's comments later in the same mail.Hooks and GUCs
==============
==============
The patch introduces a GUC atomic_foreign_transaction, which when ON ensures atomic commit for foreign transactions, otherwise not. The value of this GUC at the time of committing or preparing a local transaction is used. This gives applications the flexibility to choose the behaviour as late in the transaction as possible. This GUC has no effect if there are no foreign servers involved in the transaction.
Another GUC max_fdw_transactions sets the maximum number of transactions that can be simultaneously prepared on all the foreign servers. This limits the memory required for remembering the prepared foreign transactions.
Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given foreign server connection. An FDW which doesn't want to support this feature can keep this hook undefined (NULL). When defined the hook should return a unique identifier for the transaction prepared on the foreign server. The identifier should be unique enough not to conflict with currently prepared or future transactions. This point will be clear when discussing phase 2 of 2PC.
2. HandleForeignTransaction: to end a transaction in specified way. The hook should be able to prepare/commit/rollback current running transaction on given connection or commit/rollback a previously prepared transaction. This is described in detail while describing phase two of two-phase commit. The function is required to return a boolean status of whether the requested operation was successful or not. The function or its minions should not raise any error on failure so as not to interfere with the distributed transaction processing. This point will be clarified more in the description below.
2. HandleForeignTransaction: to end a transaction in specified way. The hook should be able to prepare/commit/rollback current running transaction on given connection or commit/rollback a previously prepared transaction. This is described in detail while describing phase two of two-phase commit. The function is required to return a boolean status of whether the requested operation was successful or not. The function or its minions should not raise any error on failure so as not to interfere with the distributed transaction processing. This point will be clarified more in the description below.
Achieving atomic commit
===================
===================
If atomic_foreign_transaction is enabled, two-commit protocol is used to achieve atomic commit for transaction involving foreign servers. All the foreign servers participating in such transaction should be capable of participating in two-phase commit protocol. If not, the local and foreign transactions are aborted as atomic commit can not be guaranteed.
Phase 1
-----------
-----------
Every FDW needs to register the connection while starting new transaction on a foreign connection (RegisterXactForeignServer()). A foreign server connection is identified by foreign server oid and the local user oid (similar to the entry cached by postgres_fdw). While registering FDW also tells whether the foreign server is capable of participating in two-phase commit protocol. How to decide that is left entirely to the FDW. An FDW like file_fdw may not have 2PC support at all, so all its foreign servers do not comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW like postgres_fdw can have some of its servers compliant and some not, depending upon server version, configuration (max_prepared_transactions = 0) etc. An FDW can decide not to register its connections at all and the foreign servers belonging to that FDW will not be considered by the core at all.
During pre-commit processing following steps are executed
1. GetPrepareId hook is called on each of the connections registered to get the identifier that will be used to prepare the transaction.
2. For each connection the prepared transaction id along with the connection information, database id and local transaction id (xid) is recorded in the memory.
1. GetPrepareId hook is called on each of the connections registered to get the identifier that will be used to prepare the transaction.
2. For each connection the prepared transaction id along with the connection information, database id and local transaction id (xid) is recorded in the memory.
3. This is logged in XLOG. If standby is configured, it is replayed on standby. In case of master failover a standby is able to resolve in-doubt prepared transactions created by the master.
4. The information is written to an on-disk file in pg_fdw_xact/ directory. This directory contains one file per prepared transaction on foreign connection. The file is fsynced during checkpoint similar to pg_twophase files. The file management in this directory is similar to the way, files are managed in pg_twophase.
5. HandleForeignTransaction is called to prepare the transaction on given connection with the identifier provided by GetPrepareId().
If the server crashes after step 5, we will remember the transaction prepared on the foreign server and will try to abort it after recovery. If it crashes after step 3 and completion of 5, we will remember a transaction that was never prepared and try to resolve it later. This scenario will be described while describing phase 2.
If any of the steps fail including the PREPARE on the foreign server itself, the local transaction will be aborted. All the prepared transactions on foreign servers will be aborted as described in phase 2 discussion below. Yet to be prepared transactions are rolled back by using the same hook. If step 5 fails, the prepared foreign transaction entry is removed from memory and disk following steps 2,3,4 in phase 2. HandleForeignTransaction throwing an error will interfere with this, so it is not expected to throw an error.
If the transactions are prepared on all the foreign servers successfully, we enter phase 2 of 2PC.
The local transaction is not required to be prepared per say.
Phase 2
-----------
-----------
After the local transaction has committed or aborted the foreign transactions prepared as part of the local transaction as committed or aborted resp. Committing or aborting prepared foreign transaction is collectively henceforth termed as "resolving" for simplicity. Following steps are executed while resolving a foreign prepared transaction.
1. Resolve the foreign prepared transaction on corresponding foreign server using user mapping of local user used at the time of preparing the transaction. This is done through hook HandleForeignTransaction().
2. If the resolution is successful, remove the prepared foreign transaction entry from the memory
3. Log about removal of entry in XLOG. When this log is replayed during recovery or in standby mode, it executes step 4 below.
4. Remove the corresponding file from pg_fdw_xact directory.
If the resolution is unsuccessful, leave the entry untouched. Since this phase is carried out when no transaction exists, HandleForeignTransaction should not throw an error and should be designed not to access database while performing this operation.
In case server crashes after step 1 and before step 3, a resolved foreign transaction will be considered unresolved when the local server recovers or standby takes over the master. It will try to resolve the prepared transaction again and should get an error from foreign server. HandleForeignTransaction hook should treat this as normal and return true since the prepared transaction is resolved (or rather there is nothing that can be done). For such cases it is important that GetPrepareId returns a transaction identifier which does not conflict with a future tansaction id, lest we may resolve (may be with wrong outcome) a prepared transaction which shouldn't be resolved.
Any crash or connection failure in phase 2 leaves the prepared transaction in unresolved state.
Resolving unresolved foreign transactions
================================
================================
A local/foreign server crash or connection failure after a transaction is prepared on the foreign server leaves that transaction in unresolved state. The patch provides a built-in function pg_fdw_resolve() to resolve those after recovering from the failure. This built-in scans all the prepared transactions in-memory and decides the fate (commit/rollback) based on the fate of local transaction that prepared it on the foreign server. It does not touch entries corresponding to the in-progress local transactions. It then executes the same steps as phase 2 to resolve the prepared foreign transactions. Since foreign server information is contained within a database, the function only touches the entries corresponding to the database from which it is invoked. A user can configure a daemon or cron-job to execute this function frequently from various databases. Alternatively, user can use contrib module pg_fdw_xact_resolver which does the same using background worker mechanism. This module needs to be installed and listed in shared_preload_libraries to start the daemon automatically on the startup.
A foreign server, user mapping corresponding to an unresolved foreign transaction is not allowed to be dropped or altered until the foreign transaction is resolved. This is required to retain the connection properties which need to resolve the prepared transaction on the foreign server.
Crash recovery
============
============
During crash recovery, the files in pg_fdw_xact/ are created or removed when corresponding WAL records are replayed. After the redo is done pg_fdw_xact directory is scanned for unresolved foreign prepared transactions. The files in this directory are named as triplet (xid, foreign server oid, user oid) to create a unique name for each file. This scan also emits the oldest transaction id with an unresolved prepared foreign transactions. This affects oldest active transaction id, since the status of this transaction id is required to decide the fate of unresolved prepared foreign transaction.
On standby during WAL replay files are just created or removed. If the standby is required to finish recovery and take over the master, pg_fdw_xact is scanned to read unresolved foreign prepared transactions into the shared memory.
Preparing transaction involving foreign server/s, on local server
=================================================
=================================================
While PREPARing a local transaction that involves foreign servers, the transactions are prepared on the foreign server (as described in phase 1 above), if atomic_foreign_transaction is enabled. If the GUC is disabled, such local transactions can not be prepared (as of this patch at least). This also means that all the foreign servers participating in the transaction to be prepared are required to support 2PC. While committing/rolling back the prepared transaction the corresponding foreign prepared transactions are committed or rolled back (as described in phase 2) resp. Any unresolved foreign transactions are resolved the same way as above.
View for checking the current foreign prepared transactions
=============================================
=============================================
A built-in function pg_fdw_xact() lists all the currently prepared foreign transactions. This function does not list anything on standby while its replaying WAL, since it doesn't have any entry in-memory. A convenient view pg_fdw_xacts lists the same with the oids converted to the names.
Handling non-atomic foreign transactions
===============================
===============================
When atomic_foreign_transaction is disabled, one-phase commit protocol is used to commit/rollback the foreign transactions. After the local transaction has committed/aborted, all the foreign transactions on the registered foreign connections are committed or aborted resp. using hook HandleForeignTransaction. Failing to commit a foreign transaction does not affect the other foreign transactions; they are still tried to be committed (if the local transaction commits).
PITR
====
====
PITR may rewind the database to a point before an xid associated with an unresolved foreign transaction. There are two approaches to deal with the situation.
1. Just forget about the unresolved foreign transaction and remove the file just like we do for a prepared local transaction. But then the prepared transaction on the foreign server might be left unresolved forever and will keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the transaction id without getting rid of prepared foreign transaction. If we do so, we might create conflicting files in future and might resolve the transaction with wrong outcome.
Rest of the mail contains replies to Heikki's comments.
On Tue, Jul 7, 2015 at 2:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:Hi All,
Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.
Wow, this is going to be a lot of new infrastructure. This is going to need good documentation, explaining how two-phase commit works in general, how it's implemented, how to monitor it etc. It's important to explain all the possible failure scenarios where you're left with in-doubt transactions, and how the DBA can resolve them.
I have included some documentation in the patch. Once we agree on the functionality, design, I will improve the documentation further.
Since we're building a Transaction Manager into PostgreSQL, please put a lot of thought on what kind of APIs it provides to the rest of the system. APIs for monitoring it, configuring it, etc. And how an extension could participate in a transaction, without necessarily being an FDW.
The patch has added all of it except extension thing. Let me know if anything is missing.
Even today and extension can participate in a transaction by registering transaction and subtransaction call backs. So, as long as an extension (and so some FDW) does things such that the failures in those do not affect the atomicity, they can use these callbacks. However, these call backs are not enough to handle unresolved prepared transactions or handle connectivity failures in the phase 2. The patch adds infrastructure to do that.
dblink might be something on your mind, but to support dblink here, it will need too liberal format for storing information about the prepared transactions on other servers. This format will vary from extension to extension, and may not be very useful as above. What we might be able to do is expose the functions for creating files for prepared transactions and logging about them and let extension use them. BTW, dblink_plus already supports 2PC for dblink.
Regarding the configuration, there are many different behaviours that an FDW could implement:
1. The FDW is read-only. Commit/abort behaviour is moot.
I can think of two flavours of read-only FDW: 1. the underlying data is read-only 2. the FDW is read-only but the underlying data is not.
In first case, the FDW may choose not to participate in the transaction management at all, so doesn't register the foreign connections. Still the rest of the transaction will be atomic.
In second case however, the writes to other foreign server may depend upon what has been read from the read-only FDW esp. in repeatable read and higher isolation levels. So it's important that the data once read remains intact till the transaction commits or at least is prepared, implying we have to start a transaction on the read-only foreign server. Once the other foreign transactions get prepared, we might be able to commit the transaction on read-only foreign server. That optimization is not yet implemented by my patch. But it should be possible to do in the approach taken by the patch. Can we leave that as a future enhancement?
In second case however, the writes to other foreign server may depend upon what has been read from the read-only FDW esp. in repeatable read and higher isolation levels. So it's important that the data once read remains intact till the transaction commits or at least is prepared, implying we have to start a transaction on the read-only foreign server. Once the other foreign transactions get prepared, we might be able to commit the transaction on read-only foreign server. That optimization is not yet implemented by my patch. But it should be possible to do in the approach taken by the patch. Can we leave that as a future enhancement?
Does that solve your concern?
2. Transactions are not supported. All updates happen immediately regardless of the local transaction.
An FDW can choose not to register its server and local PostgreSQL won't know about it. Is that acceptable behaviour?
3. Transactions are supported, but two-phase commit is not. There are three different ways we can use the remote transactions in that case:
This case is supported by using GUC atomic_foreign_transaction. The patch implements 3.2 approach.
3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do safe two-phase commit using so-called Last Resource Optimization.
IIUC LRO, the patch uses the local transaction as last resource, which is always present. The fate of foreign transaction is decided by the fate of the local transaction, which is not required to be prepared per say. There is more relevant note later.
4. Full two-phases commit support
We don't necessarily have to support all of that, but let's keep all these cases in mind when we design the how to configure FDWs. There's more to it than "does it support 2PC".
A. Steps during transaction processing
------------------------------------------------
1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.
Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.
For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed.
Just to be clear: you also need two-phase commit if the transaction updated anything in the local server and in even one foreign server.
Any local transaction involving a foreign sever transaction uses two-phase commit for the foreign transaction. The local transaction is not prepared per say. However, we should be able to optimize a case, when there are no local changes. I am not able to find a way to deduce that there was no local change, so I have left that case in this patch. Is there a way to know whether a local transaction changed something locally or not?
D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.
The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.
Your current approach with a file that's flushed to disk on every update has a few problems. Firstly, it's not crash safe. Secondly, if you make it crash-safe with fsync(), performance will suffer. You're going to need to need several fsyncs per commit with 2PC anyway, there's no way around that, but the scalable way to do that is to use the WAL so that one fsync() can flush more than one update in one operation.
So I think you'll need to do something similar to the pg_twophase files. WAL-log each update, and only flush the file/files to disk on a checkpoint. Perhaps you could use the pg_twophase infrastructure for this directly, by essentially treating every local transaction as a two-phase transaction, with some extra flag to indicate that it's an internally-created one.
I have used approach similar to pg_twophase, but implemented it as a separate code, as the requirements differ. But, I would like to minimize code by unifying both, if we finalise this design. Suggestions in this regard will be very helpful.
2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.
Or you could insert/update the rows in the catalog with xmin=FrozenXid, ignoring MVCC. Not sure how well that would work.
I am not aware how to do that. Do we have any precedence in the code. Something like a reference implementation, which I can follow. It will help to lift two restrictions
1. Restriction on the number of simultaneously prepared foreign transactions.
2. Restriction on the prepared transaction identifier length.
Obviously we may be able to shed a lot of code related to file managment, lookup etc.
3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.
Right. The pg_twophase files solve that exact same issue.
There is clearly a lot of work to do here.
I'm marking this as Returned with Feedback in the commitfest, I don't think more review is going to be helpful at this point.
That's sad. Hope people to review the patch and help it improve, even if it's out of commitfest.
- Heikki
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
Overall, you seem to have made some significant progress on the design since the last version of this patch. There's probably a lot left to do, but the design seems more mature now. I haven't read the code, but here are some comments based on the email. On Thu, Jul 9, 2015 at 6:18 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The patch introduces a GUC atomic_foreign_transaction, which when ON ensures > atomic commit for foreign transactions, otherwise not. The value of this GUC > at the time of committing or preparing a local transaction is used. This > gives applications the flexibility to choose the behaviour as late in the > transaction as possible. This GUC has no effect if there are no foreign > servers involved in the transaction. Hmm. I'm not crazy about that name, but I don't have a better idea either. One thing about this design is that it makes atomicity a property of the transaction rather than the server. That is, any given transaction is either atomic with respect to all servers or atomic with respect to none. You could also design this the other way: each server is either configured to do atomic commit, or not. When a transaction is committed, it prepares on those servers which are configured for it, and then commits the others. So then you can have a "partially atomic" transaction where, for example, you transfer money from account A to account B (using one or more FDW connections that support atomic commit) and also use twitter_fdw to tweet about it (using an FDW connection that does NOT support atomic commit). The tweet will survive even if the local commit fails, but that's OK. You could even do this at table granularity: we'll prepare the transaction if at least one foreign table involved in the transaction has atomic_commit = true. In some sense I think this might be a nicer design, because suppose you connect to a foreign server and mostly just log stuff but occasionally do important things there. In your design, you can do this, but you'll need to make sure atomic_foreign_transaction is set for the correct set of transactions. But in what I'm proposing here we might be able to derive the correct value mostly automatically. We should consider other possible designs as well; the choices we make here may have a significant impact on usability. > Another GUC max_fdw_transactions sets the maximum number of transactions > that can be simultaneously prepared on all the foreign servers. This limits > the memory required for remembering the prepared foreign transactions. How about max_prepared_foreign_transactions? > Two new FDW hooks are introduced for transaction management. > 1. GetPrepareId: to get the prepared transaction identifier for a given > foreign server connection. An FDW which doesn't want to support this feature > can keep this hook undefined (NULL). When defined the hook should return a > unique identifier for the transaction prepared on the foreign server. The > identifier should be unique enough not to conflict with currently prepared > or future transactions. This point will be clear when discussing phase 2 of > 2PC. > > 2. HandleForeignTransaction: to end a transaction in specified way. The hook > should be able to prepare/commit/rollback current running transaction on > given connection or commit/rollback a previously prepared transaction. This > is described in detail while describing phase two of two-phase commit. The > function is required to return a boolean status of whether the requested > operation was successful or not. The function or its minions should not > raise any error on failure so as not to interfere with the distributed > transaction processing. This point will be clarified more in the description > below. HandleForeignTransaction is not very descriptive, and I think you're jamming together things that ought to be separated. Let's have a PrepareForeignTransaction and a ResolvePreparedForeignTransaction. > A foreign server, user mapping corresponding to an unresolved foreign > transaction is not allowed to be dropped or altered until the foreign > transaction is resolved. This is required to retain the connection > properties which need to resolve the prepared transaction on the foreign > server. I agree with not letting it be dropped, but I think not letting it be altered is a serious mistake. Suppose the foreign server dies in a fire, its replica is promoted, and we need to re-point the master at the replica's hostname or IP. > Handling non-atomic foreign transactions > =============================== > When atomic_foreign_transaction is disabled, one-phase commit protocol is > used to commit/rollback the foreign transactions. After the local > transaction has committed/aborted, all the foreign transactions on the > registered foreign connections are committed or aborted resp. using hook > HandleForeignTransaction. Failing to commit a foreign transaction does not > affect the other foreign transactions; they are still tried to be committed > (if the local transaction commits). Is this a change from the current behavior? What if we call the first commit handler and it throws an ERROR? Presumably then nothing else gets committed, and the transaction overall aborts. > PITR > ==== > PITR may rewind the database to a point before an xid associated with an > unresolved foreign transaction. There are two approaches to deal with the > situation. > 1. Just forget about the unresolved foreign transaction and remove the file > just like we do for a prepared local transaction. But then the prepared > transaction on the foreign server might be left unresolved forever and will > keep holding the resources. > 2. Do not allow PITR to such point. We can not get rid of the transaction id > without getting rid of prepared foreign transaction. If we do so, we might > create conflicting files in future and might resolve the transaction with > wrong outcome. I don't think either of these is correct. The database shouldn't behave differently when PITR is used than when it isn't. Otherwise you are not doing what it says on the tin: recovering to the chosen point in time. I recommend adding a function that forgets about a foreign prepared transaction and making it the DBA's job to figure out whether to call it in a particular scenario. After all, the remote machine might have been subjected to PITR, too. Or maybe not. We can't know, so we should give the DBA the tools to clean things up and leave it at that. > IIUC LRO, the patch uses the local transaction as last resource, which is > always present. The fate of foreign transaction is decided by the fate of > the local transaction, which is not required to be prepared per say. There > is more relevant note later. Personally, I think that's perfectly fine. We could do more later if we wanted to, but there's plenty to like here without that. >> Just to be clear: you also need two-phase commit if the transaction >> updated anything in the local server and in even one foreign server. > > Any local transaction involving a foreign sever transaction uses two-phase > commit for the foreign transaction. The local transaction is not prepared > per say. However, we should be able to optimize a case, when there are no > local changes. I am not able to find a way to deduce that there was no local > change, so I have left that case in this patch. Is there a way to know > whether a local transaction changed something locally or not? You might check whether it wrote any WAL. There's a global variable for that somewhere; RecordTransactionCommit() uses it. But I don't think this is an essential optimization for v1, either. > I have used approach similar to pg_twophase, but implemented it as a > separate code, as the requirements differ. But, I would like to minimize > code by unifying both, if we finalise this design. Suggestions in this > regard will be very helpful. -1 for trying to unify those unless it's really clear that it's a good idea. I bet it's not. >> Or you could insert/update the rows in the catalog with xmin=FrozenXid, >> ignoring MVCC. Not sure how well that would work. > > I am not aware how to do that. Do we have any precedence in the code. No. I bet that's also a bad idea. A non-transactional table is a good idea that has been proposed before, but let's not try to invent it in this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 17, 2015 at 10:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Overall, you seem to have made some significant progress on the design
since the last version of this patch. There's probably a lot left to
do, but the design seems more mature now. I haven't read the code,
but here are some comments based on the email.
Thanks for your comments.
I have incorporated most of your suggestions (marked as Done) in the attached patch.
On Thu, Jul 9, 2015 at 6:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The patch introduces a GUC atomic_foreign_transaction, which when ON ensures
> atomic commit for foreign transactions, otherwise not. The value of this GUC
> at the time of committing or preparing a local transaction is used. This
> gives applications the flexibility to choose the behaviour as late in the
> transaction as possible. This GUC has no effect if there are no foreign
> servers involved in the transaction.
Hmm. I'm not crazy about that name, but I don't have a better idea either.
One thing about this design is that it makes atomicity a property of
the transaction rather than the server. That is, any given
transaction is either atomic with respect to all servers or atomic
with respect to none. You could also design this the other way: each
server is either configured to do atomic commit, or not. When a
transaction is committed, it prepares on those servers which are
configured for it, and then commits the others. So then you can have
a "partially atomic" transaction where, for example, you transfer
money from account A to account B (using one or more FDW connections
that support atomic commit) and also use twitter_fdw to tweet about it
(using an FDW connection that does NOT support atomic commit). The
tweet will survive even if the local commit fails, but that's OK. You
could even do this at table granularity: we'll prepare the transaction
if at least one foreign table involved in the transaction has
atomic_commit = true.
In some sense I think this might be a nicer design, because suppose
you connect to a foreign server and mostly just log stuff but
occasionally do important things there. In your design, you can do
this, but you'll need to make sure atomic_foreign_transaction is set
for the correct set of transactions. But in what I'm proposing here
we might be able to derive the correct value mostly automatically.
A user may set atomic_foreign_transaction to ON to guarantee atomicity, IOW it throws error when atomicity can not be guaranteed. Thus if application accidentally does something to a foreign server, which doesn't support 2PC, the transaction would abort. A user may set it to OFF (consciously and takes the responsibility of the result) so as not to use 2PC (probably to reduce the overheads) even if the foreign server is 2PC compliant. So, I thought a GUC would be necessary. We can incorporate the behaviour you are suggesting by having atomic_foreign_transaction accept three values "full" (ON behaviour), "partial" (behaviour you are describing), "none" (OFF behaviour). Default value of this GUC would be "partial". Will that be fine?
About table level atomic commit attribute, I agree that some foreign tables might hold "more critical" data than others from the same server, but I am not sure whether only that attribute should dictate the atomicity or not. A transaction collectively might need to be "atomic" even if the individual tables it modified are not set atomic_commit attribute. So, we need a transaction level attribute for atomicity, which may be overridden by a table level attribute. Should we add support to the table level atomicity setting as version 2+?
About table level atomic commit attribute, I agree that some foreign tables might hold "more critical" data than others from the same server, but I am not sure whether only that attribute should dictate the atomicity or not. A transaction collectively might need to be "atomic" even if the individual tables it modified are not set atomic_commit attribute. So, we need a transaction level attribute for atomicity, which may be overridden by a table level attribute. Should we add support to the table level atomicity setting as version 2+?
We should consider other possible designs as well; the choices we make
here may have a significant impact on usability.
I looked at other RBDMSes like IBM's federated database or Oracle. They support only "full" behaviour as described above with some optimizations like LRO. But, I would like to hear about other options.
> Another GUC max_fdw_transactions sets the maximum number of transactions
> that can be simultaneously prepared on all the foreign servers. This limits
> the memory required for remembering the prepared foreign transactions.
How about max_prepared_foreign_transactions?
Done.
> Two new FDW hooks are introduced for transaction management.
> 1. GetPrepareId: to get the prepared transaction identifier for a given
> foreign server connection. An FDW which doesn't want to support this feature
> can keep this hook undefined (NULL). When defined the hook should return a
> unique identifier for the transaction prepared on the foreign server. The
> identifier should be unique enough not to conflict with currently prepared
> or future transactions. This point will be clear when discussing phase 2 of
> 2PC.
>
> 2. HandleForeignTransaction: to end a transaction in specified way. The hook
> should be able to prepare/commit/rollback current running transaction on
> given connection or commit/rollback a previously prepared transaction. This
> is described in detail while describing phase two of two-phase commit. The
> function is required to return a boolean status of whether the requested
> operation was successful or not. The function or its minions should not
> raise any error on failure so as not to interfere with the distributed
> transaction processing. This point will be clarified more in the description
> below.
HandleForeignTransaction is not very descriptive, and I think you're
jamming together things that ought to be separated. Let's have a
PrepareForeignTransaction and a ResolvePreparedForeignTransaction.
Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more explanation later)
> A foreign server, user mapping corresponding to an unresolved foreign
> transaction is not allowed to be dropped or altered until the foreign
> transaction is resolved. This is required to retain the connection
> properties which need to resolve the prepared transaction on the foreign
> server.
I agree with not letting it be dropped, but I think not letting it be
altered is a serious mistake. Suppose the foreign server dies in a
fire, its replica is promoted, and we need to re-point the master at
the replica's hostname or IP.
Done
IP might be fine, but consider altering dbname option or dropping it; we won't find the prepared foreign transaction in new database. I think we should at least warn the user that there exist a prepared foreign transaction on given foreign server or user mapping; better even if we let FDW decide which options are allowed to be altered when there exists a foreign prepared transaction. The later requires some surgery in the way we handle the options.
> Handling non-atomic foreign transactions
> ===============================
> When atomic_foreign_transaction is disabled, one-phase commit protocol is
> used to commit/rollback the foreign transactions. After the local
> transaction has committed/aborted, all the foreign transactions on the
> registered foreign connections are committed or aborted resp. using hook
> HandleForeignTransaction. Failing to commit a foreign transaction does not
> affect the other foreign transactions; they are still tried to be committed
> (if the local transaction commits).
Is this a change from the current behavior?
There is no current behaviour defined per say. Each FDW is free to add its transaction callbacks, which can commit/rollback their respective transactions at pre-commit time or after the commit. postgres_fdw's callback tries to commit the foreign transactions on PRE_COMMIT event and throws error if that fails.
What if we call the first
commit handler and it throws an ERROR? Presumably then nothing else
gets committed, and the transaction overall aborts.
In this case, the fate of transaction depends upon the order in which foreign transactions are committed, in turn the order in which the foreign transactions are started. This can result in non-deterministic results. The patch tries to give it a deterministic behaviour: commit whatever can be committed and abort rest. This requires EndForeignTransaction (HandleForeignTransaction in the earlier patch) hook not to raise error. Although I do not know how to prevent it from throwing an error. We may try catching the error and not rethrowing them. But I haven't tried that.
The same requirement goes with ResolvePreparedForeignTransaction(). If that hook throws an error, we end up with unresolved prepared transactions, which will be committed only when the resolver kicks in.
> PITR
> ====
> PITR may rewind the database to a point before an xid associated with an
> unresolved foreign transaction. There are two approaches to deal with the
> situation.
> 1. Just forget about the unresolved foreign transaction and remove the file
> just like we do for a prepared local transaction. But then the prepared
> transaction on the foreign server might be left unresolved forever and will
> keep holding the resources.
> 2. Do not allow PITR to such point. We can not get rid of the transaction id
> without getting rid of prepared foreign transaction. If we do so, we might
> create conflicting files in future and might resolve the transaction with
> wrong outcome.
I don't think either of these is correct. The database shouldn't
behave differently when PITR is used than when it isn't. Otherwise
you are not doing what it says on the tin: recovering to the chosen
point in time. I recommend adding a function that forgets about a
foreign prepared transaction and making it the DBA's job to figure out
whether to call it in a particular scenario. After all, the remote
machine might have been subjected to PITR, too. Or maybe not. We
can't know, so we should give the DBA the tools to clean things up and
leave it at that.
I have added a built-in pg_fdw_remove() (or any suitable name), which removes the prepared foreign transaction entry from the memory and disk. The function needs to be called before attempting PITR. If the recovery points to a past time without removing file, we abort the recovery. In such case, a DBA can remove the foreign prepared transaction file manually before recovery. I have added a hint with that effect in the error message. Is that enough?
I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which resolve or remove unresolved prepared foreign transaction resp. are effecting changes which can not be rolled back if the transaction which ran these functions rolled back. These need to be converted into SQL command like ROLLBACK PREPARED which can't be run within a transaction.
> IIUC LRO, the patch uses the local transaction as last resource, which is
> always present. The fate of foreign transaction is decided by the fate of
> the local transaction, which is not required to be prepared per say. There
> is more relevant note later.
Personally, I think that's perfectly fine. We could do more later if
we wanted to, but there's plenty to like here without that.
Agreed.
>> Just to be clear: you also need two-phase commit if the transaction
>> updated anything in the local server and in even one foreign server.
>
> Any local transaction involving a foreign sever transaction uses two-phase
> commit for the foreign transaction. The local transaction is not prepared
> per say. However, we should be able to optimize a case, when there are no
> local changes. I am not able to find a way to deduce that there was no local
> change, so I have left that case in this patch. Is there a way to know
> whether a local transaction changed something locally or not?
You might check whether it wrote any WAL. There's a global variable
for that somewhere; RecordTransactionCommit() uses it. But I don't
think this is an essential optimization for v1, either.
Agreed.
> I have used approach similar to pg_twophase, but implemented it as a
> separate code, as the requirements differ. But, I would like to minimize
> code by unifying both, if we finalise this design. Suggestions in this
> regard will be very helpful.
-1 for trying to unify those unless it's really clear that it's a good
idea. I bet it's not.
Fine.
>> Or you could insert/update the rows in the catalog with xmin=FrozenXid,
>> ignoring MVCC. Not sure how well that would work.
>
> I am not aware how to do that. Do we have any precedence in the code.
No. I bet that's also a bad idea. A non-transactional table is a
good idea that has been proposed before, but let's not try to invent
it in this patch.
Agreed.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
On Wed, Jul 29, 2015 at 6:58 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > A user may set atomic_foreign_transaction to ON to guarantee atomicity, IOW > it throws error when atomicity can not be guaranteed. Thus if application > accidentally does something to a foreign server, which doesn't support 2PC, > the transaction would abort. A user may set it to OFF (consciously and takes > the responsibility of the result) so as not to use 2PC (probably to reduce > the overheads) even if the foreign server is 2PC compliant. So, I thought a > GUC would be necessary. We can incorporate the behaviour you are suggesting > by having atomic_foreign_transaction accept three values "full" (ON > behaviour), "partial" (behaviour you are describing), "none" (OFF > behaviour). Default value of this GUC would be "partial". Will that be fine? I don't really see the point. If the user attempts a distributed transaction involving FDWs that can't support atomic foreign transactions, then I think it's reasonable to assume that they want that to work rather than arbitrarily fail. The only situation in which it's desirable for that to fail is when the user doesn't realize that the FDW in question doesn't support atomic foreign commit, and the error message warns them that their assumptions are unfounded. But can't the user find that out easily enough by reading the documentation? So I think that in practice the "full" value of this GUC would get almost zero use; I think that nearly everyone will be happy with what you are here calling "partial" or "none". I'll defer to any other consensus that emerges, but that's my view. I think that we should not change the default behavior. Currently, the only behavior is not to attempt 2PC. Let's stick with that. > About table level atomic commit attribute, I agree that some foreign tables > might hold "more critical" data than others from the same server, but I am > not sure whether only that attribute should dictate the atomicity or not. A > transaction collectively might need to be "atomic" even if the individual > tables it modified are not set atomic_commit attribute. So, we need a > transaction level attribute for atomicity, which may be overridden by a > table level attribute. Should we add support to the table level atomicity > setting as version 2+? I'm not hung up on the table-level attribute, but I think having a server-level attribute rather than a global GUC is a good idea. However, I welcome other thoughts on that. >> We should consider other possible designs as well; the choices we make >> here may have a significant impact on usability. > > I looked at other RBDMSes like IBM's federated database or Oracle. They > support only "full" behaviour as described above with some optimizations > like LRO. But, I would like to hear about other options. Yes, I hope others will weigh in. >> HandleForeignTransaction is not very descriptive, and I think you're >> jamming together things that ought to be separated. Let's have a >> PrepareForeignTransaction and a ResolvePreparedForeignTransaction. > > Done, there are three hooks now > 1. For preparing a foreign transaction > 2. For resolving a prepared foreign transaction > 3. For committing/aborting a running foreign transaction (more explanation > later) (2) and (3) seem like the same thing. I don't see any further explanation later in your email; what am I missing? > IP might be fine, but consider altering dbname option or dropping it; we > won't find the prepared foreign transaction in new database. Probably not, but I think that's the DBA's problem, not ours. > I think we > should at least warn the user that there exist a prepared foreign > transaction on given foreign server or user mapping; better even if we let > FDW decide which options are allowed to be altered when there exists a > foreign prepared transaction. The later requires some surgery in the way we > handle the options. We can consider that, but I don't think it's an essential part of the patch, and I'd punt it for now in the interest of keeping this as simple as possible. >> Is this a change from the current behavior? > > There is no current behaviour defined per say. My point is that you had some language in the email describing what happens if the GUC is turned off. You shouldn't have to describe that, because there should be absolutely zero difference. If there isn't, that's a problem for this patch, and probably a subject for a different one. > I have added a built-in pg_fdw_remove() (or any suitable name), which > removes the prepared foreign transaction entry from the memory and disk. The > function needs to be called before attempting PITR. If the recovery points > to a past time without removing file, we abort the recovery. In such case, a > DBA can remove the foreign prepared transaction file manually before > recovery. I have added a hint with that effect in the error message. Is that > enough? That seems totally broken. Before PITR, the database might be inconsistent, in which case you can't call any functions at all. Also, you shouldn't be trying to resolve any transactions until the end of recovery, because you don't know when you see that the transaction was prepared whether, at some subsequent time, you will see it resolved. You need to finish recovery and, only after entering normal running, decide whether to resolve any transactions that are still sitting around. There should be no situation (short of e.g. OS errors writing the state files) where this stuff makes recovery fail. > I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which > resolve or remove unresolved prepared foreign transaction resp. are > effecting changes which can not be rolled back if the transaction which ran > these functions rolled back. These need to be converted into SQL command > like ROLLBACK PREPARED which can't be run within a transaction. Yeah, maybe. I'm not sure using a functional interface is all that bad, but we could think about changing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2015 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Jul 29, 2015 at 6:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> A user may set atomic_foreign_transaction to ON to guarantee atomicity, IOW
> it throws error when atomicity can not be guaranteed. Thus if application
> accidentally does something to a foreign server, which doesn't support 2PC,
> the transaction would abort. A user may set it to OFF (consciously and takes
> the responsibility of the result) so as not to use 2PC (probably to reduce
> the overheads) even if the foreign server is 2PC compliant. So, I thought a
> GUC would be necessary. We can incorporate the behaviour you are suggesting
> by having atomic_foreign_transaction accept three values "full" (ON
> behaviour), "partial" (behaviour you are describing), "none" (OFF
> behaviour). Default value of this GUC would be "partial". Will that be fine?
I don't really see the point. If the user attempts a distributed
transaction involving FDWs that can't support atomic foreign
transactions, then I think it's reasonable to assume that they want
that to work rather than arbitrarily fail. The only situation in
which it's desirable for that to fail is when the user doesn't realize
that the FDW in question doesn't support atomic foreign commit, and
the error message warns them that their assumptions are unfounded.
But can't the user find that out easily enough by reading the
documentation? So I think that in practice the "full" value of this
GUC would get almost zero use; I think that nearly everyone will be
happy with what you are here calling "partial" or "none". I'll defer
to any other consensus that emerges, but that's my view.
I think that we should not change the default behavior. Currently,
the only behavior is not to attempt 2PC. Let's stick with that.
Ok. I will remove the GUC and have "partial atomic" behaviour as you suggested in previous mail.
> About table level atomic commit attribute, I agree that some foreign tables
> might hold "more critical" data than others from the same server, but I am
> not sure whether only that attribute should dictate the atomicity or not. A
> transaction collectively might need to be "atomic" even if the individual
> tables it modified are not set atomic_commit attribute. So, we need a
> transaction level attribute for atomicity, which may be overridden by a
> table level attribute. Should we add support to the table level atomicity
> setting as version 2+?
I'm not hung up on the table-level attribute, but I think having a
server-level attribute rather than a global GUC is a good idea.
However, I welcome other thoughts on that.
The patch supports server level attribute. Let me repeat the relevant description from my earlier mail
--
Every FDW needs to register the connection while starting new transaction on a foreign connection (RegisterXactForeignServer()). A foreign server connection is identified by foreign server oid and the local user oid (similar to the entry cached by postgres_fdw). While registering, FDW also tells whether the foreign server is capable of participating in two-phase commit protocol. How to decide that is left entirely to the FDW. An FDW like file_fdw may not have 2PC support at all, so all its foreign servers do not comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW like postgres_fdw can have some of its servers compliant and some not, depending upon server version, configuration (max_prepared_transactions = 0) etc.
--
--
Every FDW needs to register the connection while starting new transaction on a foreign connection (RegisterXactForeignServer()). A foreign server connection is identified by foreign server oid and the local user oid (similar to the entry cached by postgres_fdw). While registering, FDW also tells whether the foreign server is capable of participating in two-phase commit protocol. How to decide that is left entirely to the FDW. An FDW like file_fdw may not have 2PC support at all, so all its foreign servers do not comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW like postgres_fdw can have some of its servers compliant and some not, depending upon server version, configuration (max_prepared_transactions = 0) etc.
--
Does that look good?
>> We should consider other possible designs as well; the choices we make
>> here may have a significant impact on usability.
>
> I looked at other RBDMSes like IBM's federated database or Oracle. They
> support only "full" behaviour as described above with some optimizations
> like LRO. But, I would like to hear about other options.
Yes, I hope others will weigh in.
>> HandleForeignTransaction is not very descriptive, and I think you're
>> jamming together things that ought to be separated. Let's have a
>> PrepareForeignTransaction and a ResolvePreparedForeignTransaction.
>
> Done, there are three hooks now
> 1. For preparing a foreign transaction
> 2. For resolving a prepared foreign transaction
> 3. For committing/aborting a running foreign transaction (more explanation
> later)
(2) and (3) seem like the same thing. I don't see any further
explanation later in your email; what am I missing?
In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz' (fill the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION (notice absence of PREPARED and 'xyz'). We might want to combine them into a single hook but there are slight differences there depending upon the FDW. For postgres_fdw, 2 should get a connection which should not have a running transaction, whereas for 3 there has to be a running transaction on that connection. Hook 2 should get prepared foreign transaction identifier as one of the arguments, hook 3 will not have that argument. Hook 2 will be relevant for two-phase commit protocol where as 3 will be used for connections not using two-phase commit.
The differences are much more visible in the code.
> IP might be fine, but consider altering dbname option or dropping it; we
> won't find the prepared foreign transaction in new database.
Probably not, but I think that's the DBA's problem, not ours.
Fine.
> I think we
> should at least warn the user that there exist a prepared foreign
> transaction on given foreign server or user mapping; better even if we let
> FDW decide which options are allowed to be altered when there exists a
> foreign prepared transaction. The later requires some surgery in the way we
> handle the options.
We can consider that, but I don't think it's an essential part of the
patch, and I'd punt it for now in the interest of keeping this as
simple as possible.
Fine.
>> Is this a change from the current behavior?
>
> There is no current behaviour defined per say.
My point is that you had some language in the email describing what
happens if the GUC is turned off. You shouldn't have to describe
that, because there should be absolutely zero difference. If there
isn't, that's a problem for this patch, and probably a subject for a
different one.
Ok got it.
> I have added a built-in pg_fdw_remove() (or any suitable name), which
> removes the prepared foreign transaction entry from the memory and disk. The
> function needs to be called before attempting PITR. If the recovery points
> to a past time without removing file, we abort the recovery. In such case, a
> DBA can remove the foreign prepared transaction file manually before
> recovery. I have added a hint with that effect in the error message. Is that
> enough?
That seems totally broken. Before PITR, the database might be
inconsistent, in which case you can't call any functions at all.
Also, you shouldn't be trying to resolve any transactions until the
end of recovery, because you don't know when you see that the
transaction was prepared whether, at some subsequent time, you will
see it resolved. You need to finish recovery and, only after entering
normal running, decide whether to resolve any transactions that are
still sitting around.
That's how it works in the patch for unresolved prepared foreign transactions belonging to xids within the known range. For those belonging to xids in future (beyond of known range of xids after PITR), we can not determine the status of that local transaction (as those do not appear in the xlog) and hence can not decide the fate of prepared foreign transaction. You seem to be suggesting that we should let the recovery finish and mark those prepared foreign transaction as "can not be resolved" or something like that. A DBA can remove those entries once s/he has dealt with them on the foreign server.
There's little problem with that approach. Triplet (xid, serverid, userid) is used to identify the a foreign prepared transaction entry in memory and is used to create unique file name for storing it on the disk. If we allow a future xid after PITR, it might conflict with an xid of a transaction that might take place after PITR. It will cause problem if exactly same foreign server and user participate in the transaction with conflicting xid (rare but possible).
Other problem is that the foreign server on which the transaction was prepared (or the user whose mapping was used to prepare the transaction), might have got added in a future time wrt PITR, in which case, we can not even know which foreign server this transaction was prepared on.
There should be no situation (short of e.g. OS
errors writing the state files) where this stuff makes recovery fail.
During PITR, if we encounter a prepared (local) transaction with a future xid, we just forget that prepared transaction (instead of failing recovery). May be we should do the same for unresolved foreign prepared transaction as well (at least for version 1); forget the unresolved prepared foreign transactions which belong to a future xid. Anyway, as per the timeline after PITR those never existed.
Other DBMSes solve this problem by using markers. Markers are allowed to be set at times when there were no unresolved foreign transactions and PITR is allowed upto those markers and not any arbitrary point in time. But this looks out of scope of this patch.
> I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which
> resolve or remove unresolved prepared foreign transaction resp. are
> effecting changes which can not be rolled back if the transaction which ran
> these functions rolled back. These need to be converted into SQL command
> like ROLLBACK PREPARED which can't be run within a transaction.
Yeah, maybe. I'm not sure using a functional interface is all that
bad, but we could think about changing it.
Fine.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:> >> I'm not hung up on the table-level attribute, but I think having a >> server-level attribute rather than a global GUC is a good idea. >> However, I welcome other thoughts on that. > > The patch supports server level attribute. Let me repeat the relevant > description from my earlier mail > -- > Every FDW needs to register the connection while starting new transaction on > a foreign connection (RegisterXactForeignServer()). A foreign server > connection is identified by foreign server oid and the local user oid > (similar to the entry cached by postgres_fdw). While registering, FDW also > tells whether the foreign server is capable of participating in two-phase > commit protocol. How to decide that is left entirely to the FDW. An FDW like > file_fdw may not have 2PC support at all, so all its foreign servers do not > comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW > like postgres_fdw can have some of its servers compliant and some not, > depending upon server version, configuration (max_prepared_transactions = 0) > etc. > -- > > Does that look good? OK, sure. But let's make sure postgres_fdw gets a server-level option to control this. >> > Done, there are three hooks now >> > 1. For preparing a foreign transaction >> > 2. For resolving a prepared foreign transaction >> > 3. For committing/aborting a running foreign transaction (more >> > explanation >> > later) >> >> (2) and (3) seem like the same thing. I don't see any further >> explanation later in your email; what am I missing? > > In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz' (fill > the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION > (notice absence of PREPARED and 'xyz'). Oh, OK. But then isn't #3 something we already have? i.e. pgfdw_xact_callback? >> That seems totally broken. Before PITR, the database might be >> inconsistent, in which case you can't call any functions at all. >> Also, you shouldn't be trying to resolve any transactions until the >> end of recovery, because you don't know when you see that the >> transaction was prepared whether, at some subsequent time, you will >> see it resolved. You need to finish recovery and, only after entering >> normal running, decide whether to resolve any transactions that are >> still sitting around. > > That's how it works in the patch for unresolved prepared foreign > transactions belonging to xids within the known range. For those belonging > to xids in future (beyond of known range of xids after PITR), we can not > determine the status of that local transaction (as those do not appear in > the xlog) and hence can not decide the fate of prepared foreign transaction. > You seem to be suggesting that we should let the recovery finish and mark > those prepared foreign transaction as "can not be resolved" or something > like that. A DBA can remove those entries once s/he has dealt with them on > the foreign server. > > There's little problem with that approach. Triplet (xid, serverid, userid) > is used to identify the a foreign prepared transaction entry in memory and > is used to create unique file name for storing it on the disk. If we allow a > future xid after PITR, it might conflict with an xid of a transaction that > might take place after PITR. It will cause problem if exactly same foreign > server and user participate in the transaction with conflicting xid (rare > but possible). > > Other problem is that the foreign server on which the transaction was > prepared (or the user whose mapping was used to prepare the transaction), > might have got added in a future time wrt PITR, in which case, we can not > even know which foreign server this transaction was prepared on. > >> There should be no situation (short of e.g. OS >> errors writing the state files) where this stuff makes recovery fail. > > During PITR, if we encounter a prepared (local) transaction with a future > xid, we just forget that prepared transaction (instead of failing recovery). > May be we should do the same for unresolved foreign prepared transaction as > well (at least for version 1); forget the unresolved prepared foreign > transactions which belong to a future xid. Anyway, as per the timeline after > PITR those never existed. This last sentence seems to me to be exactly on point. Note the comment in twophase.c: * We throw away any prepared xacts with main XID beyond nextXid --- if any* are present, it suggests that the DBA has donea PITR recovery to an* earlier point in time without cleaning out pg_twophase. We dare not* try to recover such preparedxacts since they likely depend on database* state that doesn't exist now. In other words, normally there should never be any XIDs "from the future" with prepared transactions; but in certain PITR scenarios it might be possible. We might as well be consistent with what the existing 2PC code does in this case - i.e. just warn and then remove the files. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 17, 2015 at 2:56 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
2. New catalog - This method takes out the need to have separate method for C1, C5 and even C2, also the synchronization will be taken care of by row locks, there will be no limit on the number of foreign transactions as well as the size of foreign prepared transaction information. But big problem with this approach is that, the changes to the catalogs are atomic with the local transaction. If a foreign prepared transaction can not be aborted while the local transaction is rolled back, that entry needs to retained. But since the local transaction is aborting the corresponding catalog entry would become invisible and thus unavailable to the resolver (alas! we do not have autonomous transaction support). We may be able to overcome this, by simulating autonomous transaction through a background worker (which can also act as a resolver). But the amount of communication and synchronization, might affect the performance.
For Rollback, why can't we do it in reverse way, first rollback
transaction in foreign servers and then rollback local transaction.
I think for Commit, it is essential that we first commit in local
server, so that we can resolve the transaction status of prepared
transactions on foreign servers after crash recovery. However
for Abort case, I think even if we don't Rollback in local server, it
can be deduced (any transaction which is not committed should be
Rolledback) during crash recovery for the matter of resolving
transaction status of prepared transaction.
On Thu, Jul 9, 2015 at 3:48 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.
Or you could insert/update the rows in the catalog with xmin=FrozenXid, ignoring MVCC. Not sure how well that would work.I am not aware how to do that. Do we have any precedence in the code. Something like a reference implementation, which I can follow.
Does some thing on lines of Copy Freeze can help here?
However if you are going to follow this method, then I think you
need to also ensure when and how to clear those rows after
rollback is complete or once resolver has resolved those prepared
foreign transactions.
On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:>
>> I'm not hung up on the table-level attribute, but I think having a
>> server-level attribute rather than a global GUC is a good idea.
>> However, I welcome other thoughts on that.
>
> The patch supports server level attribute. Let me repeat the relevant
> description from my earlier mail
> --
> Every FDW needs to register the connection while starting new transaction on
> a foreign connection (RegisterXactForeignServer()). A foreign server
> connection is identified by foreign server oid and the local user oid
> (similar to the entry cached by postgres_fdw). While registering, FDW also
> tells whether the foreign server is capable of participating in two-phase
> commit protocol. How to decide that is left entirely to the FDW. An FDW like
> file_fdw may not have 2PC support at all, so all its foreign servers do not
> comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW
> like postgres_fdw can have some of its servers compliant and some not,
> depending upon server version, configuration (max_prepared_transactions = 0)
> etc.
> --
>
> Does that look good?
OK, sure. But let's make sure postgres_fdw gets a server-level option
to control this.
For postgres_fdw it's a boolean server-level option 'twophase_compliant' (suggestions for name welcome).
>> > Done, there are three hooks now
>> > 1. For preparing a foreign transaction
>> > 2. For resolving a prepared foreign transaction
>> > 3. For committing/aborting a running foreign transaction (more
>> > explanation
>> > later)
>>
>> (2) and (3) seem like the same thing. I don't see any further
>> explanation later in your email; what am I missing?
>
> In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz' (fill
> the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION
> (notice absence of PREPARED and 'xyz').
Oh, OK. But then isn't #3 something we already have? i.e. pgfdw_xact_callback?
While transactions are being prepared on the foreign connections, if any prepare fails, we have to abort transactions on the rest of the connections (and abort the prepared transactions). pgfdw_xact_callback wouldn't know, which connections have prepared transactions and which do not have. So, even in case of two-phase commit we need all the three hooks. Since we have to define these three hooks, we might as well centralize all the transaction processing and let the foreign transaction manager decide which of the hooks to invoke. So, the patch moves most of the code in pgfdw_xact_callback in the relevant hook and foreign transaction manager invokes appropriate hook. Only thing that remains in pgfdw_xact_callback now is end of transaction handling like resetting cursor numbering.
This last sentence seems to me to be exactly on point. Note the
>> That seems totally broken. Before PITR, the database might be
>> inconsistent, in which case you can't call any functions at all.
>> Also, you shouldn't be trying to resolve any transactions until the
>> end of recovery, because you don't know when you see that the
>> transaction was prepared whether, at some subsequent time, you will
>> see it resolved. You need to finish recovery and, only after entering
>> normal running, decide whether to resolve any transactions that are
>> still sitting around.
>
> That's how it works in the patch for unresolved prepared foreign
> transactions belonging to xids within the known range. For those belonging
> to xids in future (beyond of known range of xids after PITR), we can not
> determine the status of that local transaction (as those do not appear in
> the xlog) and hence can not decide the fate of prepared foreign transaction.
> You seem to be suggesting that we should let the recovery finish and mark
> those prepared foreign transaction as "can not be resolved" or something
> like that. A DBA can remove those entries once s/he has dealt with them on
> the foreign server.
>
> There's little problem with that approach. Triplet (xid, serverid, userid)
> is used to identify the a foreign prepared transaction entry in memory and
> is used to create unique file name for storing it on the disk. If we allow a
> future xid after PITR, it might conflict with an xid of a transaction that
> might take place after PITR. It will cause problem if exactly same foreign
> server and user participate in the transaction with conflicting xid (rare
> but possible).
>
> Other problem is that the foreign server on which the transaction was
> prepared (or the user whose mapping was used to prepare the transaction),
> might have got added in a future time wrt PITR, in which case, we can not
> even know which foreign server this transaction was prepared on.
>
>> There should be no situation (short of e.g. OS
>> errors writing the state files) where this stuff makes recovery fail.
>
> During PITR, if we encounter a prepared (local) transaction with a future
> xid, we just forget that prepared transaction (instead of failing recovery).
> May be we should do the same for unresolved foreign prepared transaction as
> well (at least for version 1); forget the unresolved prepared foreign
> transactions which belong to a future xid. Anyway, as per the timeline after
> PITR those never existed.
comment in twophase.c:
* We throw away any prepared xacts with main XID beyond nextXid --- if any
* are present, it suggests that the DBA has done a PITR recovery to an
* earlier point in time without cleaning out pg_twophase. We dare not
* try to recover such prepared xacts since they likely depend on database
* state that doesn't exist now.
In other words, normally there should never be any XIDs "from the
future" with prepared transactions; but in certain PITR scenarios it
might be possible. We might as well be consistent with what the
existing 2PC code does in this case - i.e. just warn and then remove
the files.
Ok. Done.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: > On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> OK, sure. But let's make sure postgres_fdw gets a server-level option >> to control this. >> >> > For postgres_fdw it's a boolean server-level option 'twophase_compliant' > (suggestions for name welcome). > How about just 'twophase'? Thanks, Amit
On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: >> On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> >>> OK, sure. But let's make sure postgres_fdw gets a server-level option >>> to control this. >>> >>> >> For postgres_fdw it's a boolean server-level option 'twophase_compliant' >> (suggestions for name welcome). >> > > How about just 'twophase'? How about two_phase_commit? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-08-05 AM 06:11, Robert Haas wrote: > On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2015-08-03 PM 09:24, Ashutosh Bapat wrote: >>> For postgres_fdw it's a boolean server-level option 'twophase_compliant' >>> (suggestions for name welcome). >>> >> >> How about just 'twophase'? > > How about two_phase_commit? > Much cleaner, +1 Thanks, Amit
On Wed, Aug 5, 2015 at 6:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
On 2015-08-05 AM 06:11, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
>>> For postgres_fdw it's a boolean server-level option 'twophase_compliant'
>>> (suggestions for name welcome).
>>>
>>
>> How about just 'twophase'?
>
> How about two_phase_commit?
>
Much cleaner, +1
I was more inclined to use an adjective, since it's a property of server, instead of a noun. But two_phase_commit looks fine as well, included in the patch attached.
Attached patch addresses all the concerns and suggestions from previous mails in this mail thread.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
The previous patch would not compile on the latest HEAD. Here's updated patch.
On Tue, Aug 11, 2015 at 1:55 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Wed, Aug 5, 2015 at 6:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:On 2015-08-05 AM 06:11, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
>>> For postgres_fdw it's a boolean server-level option 'twophase_compliant'
>>> (suggestions for name welcome).
>>>
>>
>> How about just 'twophase'?
>
> How about two_phase_commit?
>
Much cleaner, +1I was more inclined to use an adjective, since it's a property of server, instead of a noun. But two_phase_commit looks fine as well, included in the patch attached.Attached patch addresses all the concerns and suggestions from previous mails in this mail thread.
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The previous patch would not compile on the latest HEAD. Here's updated > patch. Perhaps unsurprisingly, this doesn't apply any more. But we have bigger things to worry about. The recent eXtensible Transaction Manager and the slides shared at the Vienna sharding summit, now posted at https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make me think that some careful thought is needed here about what we want and how it should work. Slide 10 proposes a method for the extensible transaction manager API to interact with FDWs. The FDW would do this: select dtm_join_transaction(xid); begin transaction; update...; commit; I think the idea here is that the commit command doesn't really commit; it just escapes the distributed transaction while leaving it marked not-committed. When the transaction subsequently commits on the local server, the XID is marked committed and the effects of the transaction become visible on all nodes. I think that this API is intended to provide not only consistent cross-node decisions about whether a particular transaction has committed, but also consistent visibility. If the API is sufficient for that and if it can be made sufficiently performant, that's a strictly stronger guarantee than what this proposal would provide. On the other hand, I see a couple of problems: 1. The extensible transaction manager API is meant to be pluggable. Depending on which XTM module you choose to load, the SQL that needs to be executed by postgres_fdw on the remote node will vary. postgres_fdw shouldn't have knowledge of all the possible XTMs out there, so it would need some way to know what SQL to send. 2. If the remote server isn't running the same XTM as the local server, or if it is running the same XTM but is not part of the same group of cooperating nodes as the local server, then we can't send a command to join the distributed transaction at all. In that case, the 2PC for FDW approach is still, maybe, useful. On the whole, I'm inclined to think that the XTM-based approach is probably more useful and more general, if we can work out the problems with it. I'm not sure that I'm right, though, nor am I sure how hard it will be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> > The previous patch would not compile on the latest HEAD. Here's updated
> > patch.
>
> Perhaps unsurprisingly, this doesn't apply any more. But we have
> bigger things to worry about.
>
> The recent eXtensible Transaction Manager and the slides shared at the
> Vienna sharding summit, now posted at
> https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
> me think that some careful thought is needed here about what we want
> and how it should work. Slide 10 proposes a method for the extensible
> transaction manager API to interact with FDWs. The FDW would do this:
>
> select dtm_join_transaction(xid);
> begin transaction;
> update...;
> commit;
>
> I think the idea here is that the commit command doesn't really
> commit; it just escapes the distributed transaction while leaving it
> marked not-committed. When the transaction subsequently commits on
> the local server, the XID is marked committed and the effects of the
> transaction become visible on all nodes.
> I think that this API is intended to provide not only consistent
> cross-node decisions about whether a particular transaction has
> committed, but also consistent visibility. If the API is sufficient
> for that and if it can be made sufficiently performant, that's a
> strictly stronger guarantee than what this proposal would provide.
>
>
>
> On the whole, I'm inclined to think that the XTM-based approach is
> probably more useful and more general, if we can work out the problems
> with it. I'm not sure that I'm right, though, nor am I sure how hard
> it will be.
>
>
> On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
> <ashutosh.bapat@enterprisedb.com> wrote:
> > The previous patch would not compile on the latest HEAD. Here's updated
> > patch.
>
> Perhaps unsurprisingly, this doesn't apply any more. But we have
> bigger things to worry about.
>
> The recent eXtensible Transaction Manager and the slides shared at the
> Vienna sharding summit, now posted at
> https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
> me think that some careful thought is needed here about what we want
> and how it should work. Slide 10 proposes a method for the extensible
> transaction manager API to interact with FDWs. The FDW would do this:
>
> select dtm_join_transaction(xid);
> begin transaction;
> update...;
> commit;
>
> I think the idea here is that the commit command doesn't really
> commit; it just escapes the distributed transaction while leaving it
> marked not-committed. When the transaction subsequently commits on
> the local server, the XID is marked committed and the effects of the
> transaction become visible on all nodes.
>
As per my reading of the slides shared by you, the commit in above
context would send a message to Arbiter which indicates it's Vote
for being ready to commit and when Arbiter gets the votes from all
nodes participating in transaction, it sends back an ok message
(this is what I could understand from slides 12 and 13). I think on
receiving ok message each node will mark the transaction as
committed.
> I think that this API is intended to provide not only consistent
> cross-node decisions about whether a particular transaction has
> committed, but also consistent visibility. If the API is sufficient
> for that and if it can be made sufficiently performant, that's a
> strictly stronger guarantee than what this proposal would provide.
>
>
>
> On the whole, I'm inclined to think that the XTM-based approach is
> probably more useful and more general, if we can work out the problems
> with it. I'm not sure that I'm right, though, nor am I sure how hard
> it will be.
>
If I understood correctly, then the main difference between 2PC idea
used in this patch (considering we find some way of sharing snapshots
in this approach) and what is shared in slides is that XTM-based
approach relies on an external identity which it refers to as Arbiter for
performing consistent transaction commit/abort and sharing of snapshots
across all the nodes whereas in the approach in this patch, the transaction
originator (or we can call it as coordinator) is responsible for consistent
transaction commit/abort. I think the plus-point of XTM based approach is
that it provides way of sharing snapshots, but I think we still needs to evaluate
what is the overhead of communication between these methods, as far as I
can see, in Arbiter based approach, Arbiter could become single point of
contention for coordinating messages for all the transactions in a system
whereas if we extend this approach such a contention could be avoided.
Now it is very well possible that the number of messages shared between
nodes in Arbiter based approach are lesser, but still contention could play a
major role. Also another important point which needs some more thought
before concluding on any approach is detection of deadlocks between different
nodes, in the slides shared by you, there is no discussion of deadlocks,
so it is not clear whether it will work as it is without any modification or do
we need any modifications and deadlock detection system and if yes, then
how that will be achieved.
On Sat, Nov 7, 2015 at 12:52 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On the whole, I'm inclined to think that the XTM-based approach is
> > probably more useful and more general, if we can work out the problems
> > with it. I'm not sure that I'm right, though, nor am I sure how hard
> > it will be.
> >
>
> If I understood correctly, then the main difference between 2PC idea
> used in this patch (considering we find some way of sharing snapshots
> in this approach) and what is shared in slides is that XTM-based
> approach
>
> On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On the whole, I'm inclined to think that the XTM-based approach is
> > probably more useful and more general, if we can work out the problems
> > with it. I'm not sure that I'm right, though, nor am I sure how hard
> > it will be.
> >
>
> If I understood correctly, then the main difference between 2PC idea
> used in this patch (considering we find some way of sharing snapshots
> in this approach) and what is shared in slides is that XTM-based
> approach
>
Read it as DTM-based approach.
On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The previous patch would not compile on the latest HEAD. Here's updated
> patch.
Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.
The recent eXtensible Transaction Manager and the slides shared at the
Vienna sharding summit, now posted at
https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
me think that some careful thought is needed here about what we want
and how it should work. Slide 10 proposes a method for the extensible
transaction manager API to interact with FDWs. The FDW would do this:
select dtm_join_transaction(xid);
begin transaction;
update...;
commit;
I think the idea here is that the commit command doesn't really
commit; it just escapes the distributed transaction while leaving it
marked not-committed. When the transaction subsequently commits on
the local server, the XID is marked committed and the effects of the
transaction become visible on all nodes.
Since the foreign server (referred to in the slides as secondary server) requires to call "create extension pg_dtm" and select dtm_join_transaction(xid);, I assume that the foreign server has to be a PostgreSQL server and one which has this extension installed and has a version that can support this extension. So, we can not use the extension for all FDWs and even for postgres_fdw it can be used only for a foreign server with above capabilities. The slides mention just FDW but I think they mean postgres_fdw and not all FDWs.
I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.
On the other hand, I see a couple of problems:
1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.
2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.
Elaborating more on this: Slide 11 shows arbiter protocol to start a transaction and next slide shows the same for commit. Slide 15 shows the transaction flow diagram for tsDTM. In DTM approach it doesn't specify how xids are communicated between nodes, but it's implicit in the protocol that xid space is shared by the nodes. Similarly for tsDTM it assumes that CSN space is shared by all the nodes (see synchronization for max(CSN)). This can not be assumed for FDWs (not even postgres_fdw) where foreign servers are independent entities with independent xid space.
On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.
2PC for FDW and XTM are trying to solve different problems with some commonality. 2PC for FDW is trying to solve problem of atomic commit (I am borrowing from the terminology you used in PGCon 2015) for FDWs in general (although limited to FDWs which can support 2 phase commit) and XTM tries to solve problems of atomic visibility, atomic commit and consistency for postgres_fdw where foreign servers support XTM. The only thing common between these two is atomic visibility.
If we accept XTM and discard 2PC for FDW, we will not be able to support atomic commit for FDWs in general. That, I think would be serious limitation for Postgres FDW, esp. now that DMLs are allowed. If we accept only 2PC for FDW and discard XTM, we won't be able to get atomic visibility and consistency for postgres_fdw with foreign servers supporting XTM. That would be again serious limitation for solutions implementing sharding, multi-master clusters etc.
There are approaches like [1] by which cluster of heterogenous servers (with some level of snapshot isolation) can be constructed. Ideally that will enable PostgreSQL users to maximize their utilization of FDWs.
Any distributed transaction management requires 2PC in some or other form. So, we should implement 2PC for FDW keeping in mind various forms of 2PC used practically. Use that infrastructure to build XTM like capabilities for restricted postgres_fdw uses. Previously, I have requested the authors of XTM to look at my patch and provide me feedback about their requirements for implementing 2PC part of XTM. But I have not heard anything from them.
Any distributed transaction management requires 2PC in some or other form. So, we should implement 2PC for FDW keeping in mind various forms of 2PC used practically. Use that infrastructure to build XTM like capabilities for restricted postgres_fdw uses. Previously, I have requested the authors of XTM to look at my patch and provide me feedback about their requirements for implementing 2PC part of XTM. But I have not heard anything from them.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 06.11.2015 21:37, Robert Haas wrote: > On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The previous patch would not compile on the latest HEAD. Here's updated >> patch. > Perhaps unsurprisingly, this doesn't apply any more. But we have > bigger things to worry about. > > The recent eXtensible Transaction Manager and the slides shared at the > Vienna sharding summit, now posted at > https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make > me think that some careful thought is needed here about what we want > and how it should work. Slide 10 proposes a method for the extensible > transaction manager API to interact with FDWs. The FDW would do this: > > select dtm_join_transaction(xid); > begin transaction; > update...; > commit; > > I think the idea here is that the commit command doesn't really > commit; it just escapes the distributed transaction while leaving it > marked not-committed. When the transaction subsequently commits on > the local server, the XID is marked committed and the effects of the > transaction become visible on all nodes. > > I think that this API is intended to provide not only consistent > cross-node decisions about whether a particular transaction has > committed, but also consistent visibility. If the API is sufficient > for that and if it can be made sufficiently performant, that's a > strictly stronger guarantee than what this proposal would provide. > > On the other hand, I see a couple of problems: > > 1. The extensible transaction manager API is meant to be pluggable. > Depending on which XTM module you choose to load, the SQL that needs > to be executed by postgres_fdw on the remote node will vary. > postgres_fdw shouldn't have knowledge of all the possible XTMs out > there, so it would need some way to know what SQL to send. > > 2. If the remote server isn't running the same XTM as the local > server, or if it is running the same XTM but is not part of the same > group of cooperating nodes as the local server, then we can't send a > command to join the distributed transaction at all. In that case, the > 2PC for FDW approach is still, maybe, useful. > > On the whole, I'm inclined to think that the XTM-based approach is > probably more useful and more general, if we can work out the problems > with it. I'm not sure that I'm right, though, nor am I sure how hard > it will be. Sorry, but we currently considered only case of homogeneous environment: when all cluster instances are using PostgreSQL with the same XTM implementation. I can imagine situations when it may be useful to coordinate transaction processing in heterogeneous cluster, but it seems to be quite exotic use case. Combining several different databases on one cluster can be explained by some historical reasons or specific of particular system architecture. But I can not imagine any reason for using different XTM implementations and especially mixing them in one transaction.
On 09.11.2015 09:59, Ashutosh Bapat wrote:
Since the foreign server (referred to in the slides as secondary server) requires to call "create extension pg_dtm" and select dtm_join_transaction(xid);, I assume that the foreign server has to be a PostgreSQL server and one which has this extension installed and has a version that can support this extension. So, we can not use the extension for all FDWs and even for postgres_fdw it can be used only for a foreign server with above capabilities. The slides mention just FDW but I think they mean postgres_fdw and not all FDWs.
DTM approach is based on sharing XIDs and snapshots between different cluster nodes, so it really can be easily implemented only for PostgreSQL. So I really have in mind postgres_fdw rather than abstract FDW.
Approach with timestamps is more universal and in principle can be used for any DBMS where visibility is based on CSNs.
I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.
On the other hand, I see a couple of problems:
1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.
2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.Elaborating more on this: Slide 11 shows arbiter protocol to start a transaction and next slide shows the same for commit. Slide 15 shows the transaction flow diagram for tsDTM. In DTM approach it doesn't specify how xids are communicated between nodes, but it's implicit in the protocol that xid space is shared by the nodes. Similarly for tsDTM it assumes that CSN space is shared by all the nodes (see synchronization for max(CSN)). This can not be assumed for FDWs (not even postgres_fdw) where foreign servers are independent entities with independent xid space.
Proposed architecture of DTM includes "coordinator". Coordinator is a process responsible for managing logic of distributed transaction. It can be just a normal client application, or it can be intermediate master node (like in case of pg_shard).
It can be also PostgreSQL instance (as in case of postgres_fdw) or not. We try to put as less restriction on "coordinator" as possible.
It should just communicate with PostgreSQL backends using any communication protocol it likes (i.e. libpq) and invokes some special stored procedures which are part of particular DTM extension. Such functions also impose some protocol of exchanging data between different nodes involved in distributed transaction. In such way we are propagating XIDs/CSNs between different nodes which may even do not know about each other.
In DTM approach nodes only know about location of "arbiter". In tsDTM approach there is even not arbiter...
On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.2PC for FDW and XTM are trying to solve different problems with some commonality. 2PC for FDW is trying to solve problem of atomic commit (I am borrowing from the terminology you used in PGCon 2015) for FDWs in general (although limited to FDWs which can support 2 phase commit) and XTM tries to solve problems of atomic visibility, atomic commit and consistency for postgres_fdw where foreign servers support XTM. The only thing common between these two is atomic visibility.If we accept XTM and discard 2PC for FDW, we will not be able to support atomic commit for FDWs in general. That, I think would be serious limitation for Postgres FDW, esp. now that DMLs are allowed. If we accept only 2PC for FDW and discard XTM, we won't be able to get atomic visibility and consistency for postgres_fdw with foreign servers supporting XTM. That would be again serious limitation for solutions implementing sharding, multi-master clusters etc.There are approaches like [1] by which cluster of heterogenous servers (with some level of snapshot isolation) can be constructed. Ideally that will enable PostgreSQL users to maximize their utilization of FDWs.
Any distributed transaction management requires 2PC in some or other form. So, we should implement 2PC for FDW keeping in mind various forms of 2PC used practically. Use that infrastructure to build XTM like capabilities for restricted postgres_fdw uses. Previously, I have requested the authors of XTM to look at my patch and provide me feedback about their requirements for implementing 2PC part of XTM. But I have not heard anything from them.
Sorry, may be I missed some message. but I have not received request from you to provide feedback concerning your patch.
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Any distributed transaction management requires 2PC in some or other form. So, we should implement 2PC for FDW keeping in mind various forms of 2PC used practically. Use that infrastructure to build XTM like capabilities for restricted postgres_fdw uses. Previously, I have requested the authors of XTM to look at my patch and provide me feedback about their requirements for implementing 2PC part of XTM. But I have not heard anything from them.
Sorry, may be I missed some message. but I have not received request from you to provide feedback concerning your patch.
See my mail on 31st August on hackers in the thread with subject "Horizontal scalability/sharding".
--Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
> The previous patch would not compile on the latest HEAD. Here's updated
> patch.
Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.
Here's updated patch. I didn't use version numbers in file names in my previous patches. I am starting from this onwards.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Attachment
On Mon, Nov 9, 2015 at 8:55 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > The previous patch would not compile on the latest HEAD. Here's updated >> > patch. >> >> Perhaps unsurprisingly, this doesn't apply any more. But we have >> bigger things to worry about. >> > > Here's updated patch. I didn't use version numbers in file names in my > previous patches. I am starting from this onwards. Ashutosh, others, this thread has been stalling for more than 1 month and a half. There is a new patch that still applies (be careful of whitespaces btw), but no reviews came in. So what should we do? I would tend to move this patch to the next CF because of a lack of reviews. -- Michael
On Thu, Dec 24, 2015 at 8:32 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
Yes, that would help. Thanks.
On Mon, Nov 9, 2015 at 8:55 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
>
> On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
>> <ashutosh.bapat@enterprisedb.com> wrote:
>> > The previous patch would not compile on the latest HEAD. Here's updated
>> > patch.
>>
>> Perhaps unsurprisingly, this doesn't apply any more. But we have
>> bigger things to worry about.
>>
>
> Here's updated patch. I didn't use version numbers in file names in my
> previous patches. I am starting from this onwards.
Ashutosh, others, this thread has been stalling for more than 1 month
and a half. There is a new patch that still applies (be careful of
whitespaces btw), but no reviews came in. So what should we do? I
would tend to move this patch to the next CF because of a lack of
reviews.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Thu, Dec 24, 2015 at 7:03 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Dec 24, 2015 at 8:32 AM, Michael Paquier <michael.paquier@gmail.com> >> Ashutosh, others, this thread has been stalling for more than 1 month >> and a half. There is a new patch that still applies (be careful of >> whitespaces btw), but no reviews came in. So what should we do? I >> would tend to move this patch to the next CF because of a lack of >> reviews. > > > Yes, that would help. Thanks. Done. -- Michael
Ashutosh Bapat wrote: > Here's updated patch. I didn't use version numbers in file names in my > previous patches. I am starting from this onwards. Um, I tried this patch and it doesn't apply at all. There's a large number of conflicts. Please update it and resubmit to the next commitfest. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote: > Ashutosh Bapat wrote: > > > Here's updated patch. I didn't use version numbers in file names in my > > previous patches. I am starting from this onwards. > > Um, I tried this patch and it doesn't apply at all. There's a large > number of conflicts. Please update it and resubmit to the next > commitfest. Also, please run "git show --check" of "git diff origin/master --check" and fix the whitespace problems that it shows. It's an easy thing but there's a lot of red squares in my screen. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<div dir="ltr"><p>Hi All,<p>Ashutosh proposed the feature 2PC for FDW for achieving atomic commits across multiple foreignservers. <br /> If a transaction make changes to more than two foreign servers the current implementation in postgres_fdwdoesn't make sure that either all of them commit or all of them rollback their changes. <br /><br /> We (MasahikoSawada and me) reopen this thread and trying to contribute in it. <br /><br /> 2PC for FDW <br /> ============ <br/> The patch provides support for atomic commit for transactions involving foreign servers. when the transaction makeschanges to foreign servers, <br /> either all the changes to all the foreign servers commit or rollback. <br /><br />The new patch 2PC for FDW include the following things: <br /> 1. The patch 0001 introduces a generic feature. All kindsof FDW that support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve in the transaction. <br /><br />Currentlywe can push some conditions down to shard nodes, especially in 9.6 the directly modify feature has <br />beenintroduced. But such a transaction modifying data on shard node is not executed surely. <br />Using 0002 patch, thatmodify is executed with 2PC. It means that we almost can provide sharding solution using <br />multiple PostgreSQL server(one parent node and several shared node). <br /><br />For multi master, we definitely need transaction manager buttransaction manager probably can use this 2PC for FDW feature to manage distributed transaction. <br /><br /> 2. 0002patch makes postgres_fdw possible to use 2PC.<br /><p> 0002 patch makes postgres_fdw to use below APIs. These APIs aregeneric features which can be used by all kinds of FDWs.<p> a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPREDinstead of COMMIT/ABORT on foreign server which supports 2PC. <br /> b. Manage information of foreign preparedtransactions resolver <br /><p>Masahiko Sawada will post the patch. <br /><br /> Suggestions and comments are helpfulto implement this feature.<br /><br /> Regards, <br /><br /> Vinayak Pokale </div><div class="gmail_extra"><br /><divclass="gmail_quote">On Mon, Feb 1, 2016 at 11:14 PM, Alvaro Herrera <span dir="ltr"><<a href="mailto:alvherre@2ndquadrant.com"target="_blank">alvherre@2ndquadrant.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Alvaro Herrera wrote:<br/> > Ashutosh Bapat wrote:<br /> ><br /> > > Here's updated patch. I didn't use version numbers in filenames in my<br /> > > previous patches. I am starting from this onwards.<br /> ><br /> > Um, I tried thispatch and it doesn't apply at all. There's a large<br /> > number of conflicts. Please update it and resubmit tothe next<br /> > commitfest.<br /><br /></span>Also, please run "git show --check" of "git diff origin/master --check"<br/> and fix the whitespace problems that it shows. It's an easy thing but<br /> there's a lot of red squares inmy screen.<br /><div class="HOEnZb"><div class="h5"><br /> --<br /> Álvaro Herrera <a href="http://www.2ndQuadrant.com/"rel="noreferrer" target="_blank">http://www.2ndQuadrant.com/</a><br /> PostgreSQL Development,24x7 Support, Remote DBA, Training & Services<br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list(<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br /> To make changes to your subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer" target="_blank">http://www.postgresql.org/<wbr/>mailpref/pgsql-hackers</a><br /></div></div></blockquote></div><br /></div>
On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> wrote: > Hi All, > > Ashutosh proposed the feature 2PC for FDW for achieving atomic commits > across multiple foreign servers. > If a transaction make changes to more than two foreign servers the current > implementation in postgres_fdw doesn't make sure that either all of them > commit or all of them rollback their changes. > > We (Masahiko Sawada and me) reopen this thread and trying to contribute in > it. > > 2PC for FDW > ============ > The patch provides support for atomic commit for transactions involving > foreign servers. when the transaction makes changes to foreign servers, > either all the changes to all the foreign servers commit or rollback. > > The new patch 2PC for FDW include the following things: > 1. The patch 0001 introduces a generic feature. All kinds of FDW that > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve in > the transaction. > > Currently we can push some conditions down to shard nodes, especially in 9.6 > the directly modify feature has > been introduced. But such a transaction modifying data on shard node is not > executed surely. > Using 0002 patch, that modify is executed with 2PC. It means that we almost > can provide sharding solution using > multiple PostgreSQL server (one parent node and several shared node). > > For multi master, we definitely need transaction manager but transaction > manager probably can use this 2PC for FDW feature to manage distributed > transaction. > > 2. 0002 patch makes postgres_fdw possible to use 2PC. > > 0002 patch makes postgres_fdw to use below APIs. These APIs are generic > features which can be used by all kinds of FDWs. > > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of > COMMIT/ABORT on foreign server which supports 2PC. > b. Manage information of foreign prepared transactions resolver > > Masahiko Sawada will post the patch. > > Still lot of work to do but attached latest patches. These are based on the patch Ashutosh posted before, I revised it and divided into two patches. Compare with original patch, patch of pg_fdw_xact_resolver and documentation are lacked. Feedback and suggestion are very welcome. Regards, -- Masahiko Sawada
Attachment
On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> wrote:
> Hi All,
>
> Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
> across multiple foreign servers.
> If a transaction make changes to more than two foreign servers the current
> implementation in postgres_fdw doesn't make sure that either all of them
> commit or all of them rollback their changes.
>
> We (Masahiko Sawada and me) reopen this thread and trying to contribute in
> it.
>
> 2PC for FDW
> ============
> The patch provides support for atomic commit for transactions involving
> foreign servers. when the transaction makes changes to foreign servers,
> either all the changes to all the foreign servers commit or rollback.
>
> The new patch 2PC for FDW include the following things:
> 1. The patch 0001 introduces a generic feature. All kinds of FDW that
> support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve in
> the transaction.
>
> Currently we can push some conditions down to shard nodes, especially in 9.6
> the directly modify feature has
> been introduced. But such a transaction modifying data on shard node is not
> executed surely.
> Using 0002 patch, that modify is executed with 2PC. It means that we almost
> can provide sharding solution using
> multiple PostgreSQL server (one parent node and several shared node).
>
> For multi master, we definitely need transaction manager but transaction
> manager probably can use this 2PC for FDW feature to manage distributed
> transaction.
>
> 2. 0002 patch makes postgres_fdw possible to use 2PC.
>
> 0002 patch makes postgres_fdw to use below APIs. These APIs are generic
> features which can be used by all kinds of FDWs.
>
> a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
> COMMIT/ABORT on foreign server which supports 2PC.
> b. Manage information of foreign prepared transactions resolver
>
> Masahiko Sawada will post the patch.
>
>
Thanks Vinayak and Sawada-san for taking this forward and basing your work on my patch.
Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.
I am not able to understand the last statement.
Do you mean to say that your patches do not have pg_fdw_xact_resolver() and documentation that my patches had?
OR
you mean to say that my patches did not have (lacked) pg_fdw_xact_resolver() and documenation
OR some combination of those?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> >> wrote: >> > Hi All, >> > >> > Ashutosh proposed the feature 2PC for FDW for achieving atomic commits >> > across multiple foreign servers. >> > If a transaction make changes to more than two foreign servers the >> > current >> > implementation in postgres_fdw doesn't make sure that either all of them >> > commit or all of them rollback their changes. >> > >> > We (Masahiko Sawada and me) reopen this thread and trying to contribute >> > in >> > it. >> > >> > 2PC for FDW >> > ============ >> > The patch provides support for atomic commit for transactions involving >> > foreign servers. when the transaction makes changes to foreign servers, >> > either all the changes to all the foreign servers commit or rollback. >> > >> > The new patch 2PC for FDW include the following things: >> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that >> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve >> > in >> > the transaction. >> > >> > Currently we can push some conditions down to shard nodes, especially in >> > 9.6 >> > the directly modify feature has >> > been introduced. But such a transaction modifying data on shard node is >> > not >> > executed surely. >> > Using 0002 patch, that modify is executed with 2PC. It means that we >> > almost >> > can provide sharding solution using >> > multiple PostgreSQL server (one parent node and several shared node). >> > >> > For multi master, we definitely need transaction manager but transaction >> > manager probably can use this 2PC for FDW feature to manage distributed >> > transaction. >> > >> > 2. 0002 patch makes postgres_fdw possible to use 2PC. >> > >> > 0002 patch makes postgres_fdw to use below APIs. These APIs are generic >> > features which can be used by all kinds of FDWs. >> > >> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of >> > COMMIT/ABORT on foreign server which supports 2PC. >> > b. Manage information of foreign prepared transactions resolver >> > >> > Masahiko Sawada will post the patch. >> > >> > >> > > Thanks Vinayak and Sawada-san for taking this forward and basing your work > on my patch. > >> >> Still lot of work to do but attached latest patches. >> These are based on the patch Ashutosh posted before, I revised it and >> divided into two patches. >> Compare with original patch, patch of pg_fdw_xact_resolver and >> documentation are lacked. > > > I am not able to understand the last statement. Sorry to confuse you. > Do you mean to say that your patches do not have pg_fdw_xact_resolver() and > documentation that my patches had? Yes. I'm confirming them that your patches had. Regards, -- Masahiko Sawada
On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Sorry to confuse you.On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
>
> On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
>> > across multiple foreign servers.
>> > If a transaction make changes to more than two foreign servers the
>> > current
>> > implementation in postgres_fdw doesn't make sure that either all of them
>> > commit or all of them rollback their changes.
>> >
>> > We (Masahiko Sawada and me) reopen this thread and trying to contribute
>> > in
>> > it.
>> >
>> > 2PC for FDW
>> > ============
>> > The patch provides support for atomic commit for transactions involving
>> > foreign servers. when the transaction makes changes to foreign servers,
>> > either all the changes to all the foreign servers commit or rollback.
>> >
>> > The new patch 2PC for FDW include the following things:
>> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that
>> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve
>> > in
>> > the transaction.
>> >
>> > Currently we can push some conditions down to shard nodes, especially in
>> > 9.6
>> > the directly modify feature has
>> > been introduced. But such a transaction modifying data on shard node is
>> > not
>> > executed surely.
>> > Using 0002 patch, that modify is executed with 2PC. It means that we
>> > almost
>> > can provide sharding solution using
>> > multiple PostgreSQL server (one parent node and several shared node).
>> >
>> > For multi master, we definitely need transaction manager but transaction
>> > manager probably can use this 2PC for FDW feature to manage distributed
>> > transaction.
>> >
>> > 2. 0002 patch makes postgres_fdw possible to use 2PC.
>> >
>> > 0002 patch makes postgres_fdw to use below APIs. These APIs are generic
>> > features which can be used by all kinds of FDWs.
>> >
>> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
>> > COMMIT/ABORT on foreign server which supports 2PC.
>> > b. Manage information of foreign prepared transactions resolver
>> >
>> > Masahiko Sawada will post the patch.
>> >
>> >
>>
>
> Thanks Vinayak and Sawada-san for taking this forward and basing your work
> on my patch.
>
>>
>> Still lot of work to do but attached latest patches.
>> These are based on the patch Ashutosh posted before, I revised it and
>> divided into two patches.
>> Compare with original patch, patch of pg_fdw_xact_resolver and
>> documentation are lacked.
>
>
> I am not able to understand the last statement.
> Do you mean to say that your patches do not have pg_fdw_xact_resolver() and
> documentation that my patches had?
Yes.
I'm confirming them that your patches had.
Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve any transactions which can not be resolved immediately after they were prepared. There was a comment from Kevin (IIRC) that leaving transactions unresolved on the foreign server keeps the resources locked on those servers. That's not a very good situation. And nobody but the initiating server can resolve those. That functionality is important to make it a complete 2PC solution. So, please consider it to be included in your first set of patches.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Fri, Aug 26, 2016 at 3:13 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > >> > >> > On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada >> > <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> >> >> wrote: >> >> > Hi All, >> >> > >> >> > Ashutosh proposed the feature 2PC for FDW for achieving atomic >> >> > commits >> >> > across multiple foreign servers. >> >> > If a transaction make changes to more than two foreign servers the >> >> > current >> >> > implementation in postgres_fdw doesn't make sure that either all of >> >> > them >> >> > commit or all of them rollback their changes. >> >> > >> >> > We (Masahiko Sawada and me) reopen this thread and trying to >> >> > contribute >> >> > in >> >> > it. >> >> > >> >> > 2PC for FDW >> >> > ============ >> >> > The patch provides support for atomic commit for transactions >> >> > involving >> >> > foreign servers. when the transaction makes changes to foreign >> >> > servers, >> >> > either all the changes to all the foreign servers commit or rollback. >> >> > >> >> > The new patch 2PC for FDW include the following things: >> >> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that >> >> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can >> >> > involve >> >> > in >> >> > the transaction. >> >> > >> >> > Currently we can push some conditions down to shard nodes, especially >> >> > in >> >> > 9.6 >> >> > the directly modify feature has >> >> > been introduced. But such a transaction modifying data on shard node >> >> > is >> >> > not >> >> > executed surely. >> >> > Using 0002 patch, that modify is executed with 2PC. It means that we >> >> > almost >> >> > can provide sharding solution using >> >> > multiple PostgreSQL server (one parent node and several shared node). >> >> > >> >> > For multi master, we definitely need transaction manager but >> >> > transaction >> >> > manager probably can use this 2PC for FDW feature to manage >> >> > distributed >> >> > transaction. >> >> > >> >> > 2. 0002 patch makes postgres_fdw possible to use 2PC. >> >> > >> >> > 0002 patch makes postgres_fdw to use below APIs. These APIs are >> >> > generic >> >> > features which can be used by all kinds of FDWs. >> >> > >> >> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead >> >> > of >> >> > COMMIT/ABORT on foreign server which supports 2PC. >> >> > b. Manage information of foreign prepared transactions resolver >> >> > >> >> > Masahiko Sawada will post the patch. >> >> > >> >> > >> >> >> > >> > Thanks Vinayak and Sawada-san for taking this forward and basing your >> > work >> > on my patch. >> > >> >> >> >> Still lot of work to do but attached latest patches. >> >> These are based on the patch Ashutosh posted before, I revised it and >> >> divided into two patches. >> >> Compare with original patch, patch of pg_fdw_xact_resolver and >> >> documentation are lacked. >> > >> > >> > I am not able to understand the last statement. >> >> Sorry to confuse you. >> >> > Do you mean to say that your patches do not have pg_fdw_xact_resolver() >> > and >> > documentation that my patches had? >> >> Yes. >> I'm confirming them that your patches had. > > > Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve > any transactions which can not be resolved immediately after they were > prepared. There was a comment from Kevin (IIRC) that leaving transactions > unresolved on the foreign server keeps the resources locked on those > servers. That's not a very good situation. And nobody but the initiating > server can resolve those. That functionality is important to make it a > complete 2PC solution. So, please consider it to be included in your first > set of patches. > Yeah, I know the reason why pg_fdw_xact_resolver is required. I will add it as a separated patch. Regards, -- Masahiko Sawada
On 2016/08/26 15:13, Ashutosh Bapat wrote:
The attached patch included pg_fdw_xact_resolver.On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:Sorry to confuse you.On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:
>
>
> On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
>> wrote:
>> > Hi All,
>> >
>> > Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
>> > across multiple foreign servers.
>> > If a transaction make changes to more than two foreign servers the
>> > current
>> > implementation in postgres_fdw doesn't make sure that either all of them
>> > commit or all of them rollback their changes.
>> >
>> > We (Masahiko Sawada and me) reopen this thread and trying to contribute
>> > in
>> > it.
>> >
>> > 2PC for FDW
>> > ============
>> > The patch provides support for atomic commit for transactions involving
>> > foreign servers. when the transaction makes changes to foreign servers,
>> > either all the changes to all the foreign servers commit or rollback.
>> >
>> > The new patch 2PC for FDW include the following things:
>> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that
>> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve
>> > in
>> > the transaction.
>> >
>> > Currently we can push some conditions down to shard nodes, especially in
>> > 9.6
>> > the directly modify feature has
>> > been introduced. But such a transaction modifying data on shard node is
>> > not
>> > executed surely.
>> > Using 0002 patch, that modify is executed with 2PC. It means that we
>> > almost
>> > can provide sharding solution using
>> > multiple PostgreSQL server (one parent node and several shared node).
>> >
>> > For multi master, we definitely need transaction manager but transaction
>> > manager probably can use this 2PC for FDW feature to manage distributed
>> > transaction.
>> >
>> > 2. 0002 patch makes postgres_fdw possible to use 2PC.
>> >
>> > 0002 patch makes postgres_fdw to use below APIs. These APIs are generic
>> > features which can be used by all kinds of FDWs.
>> >
>> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
>> > COMMIT/ABORT on foreign server which supports 2PC.
>> > b. Manage information of foreign prepared transactions resolver
>> >
>> > Masahiko Sawada will post the patch.
>> >
>> >
>>
>
> Thanks Vinayak and Sawada-san for taking this forward and basing your work
> on my patch.
>
>>
>> Still lot of work to do but attached latest patches.
>> These are based on the patch Ashutosh posted before, I revised it and
>> divided into two patches.
>> Compare with original patch, patch of pg_fdw_xact_resolver and
>> documentation are lacked.
>
>
> I am not able to understand the last statement.
> Do you mean to say that your patches do not have pg_fdw_xact_resolver() and
> documentation that my patches had?
Yes.
I'm confirming them that your patches had.Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve any transactions which can not be resolved immediately after they were prepared. There was a comment from Kevin (IIRC) that leaving transactions unresolved on the foreign server keeps the resources locked on those servers. That's not a very good situation. And nobody but the initiating server can resolve those. That functionality is important to make it a complete 2PC solution. So, please consider it to be included in your first set of patches.
Regards,
Vinayak Pokale
NTT Open Source Software Center
Attachment
On 2016/09/07 10:54, vinayak wrote:
The attached patch includes the documentation.The attached patch included pg_fdw_xact_resolver.Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve any transactions which can not be resolved immediately after they were prepared. There was a comment from Kevin (IIRC) that leaving transactions unresolved on the foreign server keeps the resources locked on those servers. That's not a very good situation. And nobody but the initiating server can resolve those. That functionality is important to make it a complete 2PC solution. So, please consider it to be included in your first set of patches.
Regards,
Vinayak Pokale
NTT Open Source Software Center
Attachment
My original patch added code to manage the files for 2 phase transactions opened by the local server on the remote servers. This code was mostly inspired from the code in twophase.c which manages the file for prepared transactions. The logic to manage 2PC files has changed since [1] and has been optimized. One of the things I wanted to do is see, if those optimizations are applicable here as well. Have you considered that? [1]. https://www.postgresql.org/message-id/74355FCF-AADC-4E51-850B-47AF59E0B215%40postgrespro.ru On Fri, Aug 26, 2016 at 11:43 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > > > On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > >> > >> > On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada >> > <sawada.mshk@gmail.com> >> > wrote: >> >> >> >> On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> >> >> wrote: >> >> > Hi All, >> >> > >> >> > Ashutosh proposed the feature 2PC for FDW for achieving atomic >> >> > commits >> >> > across multiple foreign servers. >> >> > If a transaction make changes to more than two foreign servers the >> >> > current >> >> > implementation in postgres_fdw doesn't make sure that either all of >> >> > them >> >> > commit or all of them rollback their changes. >> >> > >> >> > We (Masahiko Sawada and me) reopen this thread and trying to >> >> > contribute >> >> > in >> >> > it. >> >> > >> >> > 2PC for FDW >> >> > ============ >> >> > The patch provides support for atomic commit for transactions >> >> > involving >> >> > foreign servers. when the transaction makes changes to foreign >> >> > servers, >> >> > either all the changes to all the foreign servers commit or rollback. >> >> > >> >> > The new patch 2PC for FDW include the following things: >> >> > 1. The patch 0001 introduces a generic feature. All kinds of FDW that >> >> > support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can >> >> > involve >> >> > in >> >> > the transaction. >> >> > >> >> > Currently we can push some conditions down to shard nodes, especially >> >> > in >> >> > 9.6 >> >> > the directly modify feature has >> >> > been introduced. But such a transaction modifying data on shard node >> >> > is >> >> > not >> >> > executed surely. >> >> > Using 0002 patch, that modify is executed with 2PC. It means that we >> >> > almost >> >> > can provide sharding solution using >> >> > multiple PostgreSQL server (one parent node and several shared node). >> >> > >> >> > For multi master, we definitely need transaction manager but >> >> > transaction >> >> > manager probably can use this 2PC for FDW feature to manage >> >> > distributed >> >> > transaction. >> >> > >> >> > 2. 0002 patch makes postgres_fdw possible to use 2PC. >> >> > >> >> > 0002 patch makes postgres_fdw to use below APIs. These APIs are >> >> > generic >> >> > features which can be used by all kinds of FDWs. >> >> > >> >> > a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead >> >> > of >> >> > COMMIT/ABORT on foreign server which supports 2PC. >> >> > b. Manage information of foreign prepared transactions resolver >> >> > >> >> > Masahiko Sawada will post the patch. >> >> > >> >> > >> >> >> > >> > Thanks Vinayak and Sawada-san for taking this forward and basing your >> > work >> > on my patch. >> > >> >> >> >> Still lot of work to do but attached latest patches. >> >> These are based on the patch Ashutosh posted before, I revised it and >> >> divided into two patches. >> >> Compare with original patch, patch of pg_fdw_xact_resolver and >> >> documentation are lacked. >> > >> > >> > I am not able to understand the last statement. >> >> Sorry to confuse you. >> >> > Do you mean to say that your patches do not have pg_fdw_xact_resolver() >> > and >> > documentation that my patches had? >> >> Yes. >> I'm confirming them that your patches had. > > > Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve > any transactions which can not be resolved immediately after they were > prepared. There was a comment from Kevin (IIRC) that leaving transactions > unresolved on the foreign server keeps the resources locked on those > servers. That's not a very good situation. And nobody but the initiating > server can resolve those. That functionality is important to make it a > complete 2PC solution. So, please consider it to be included in your first > set of patches. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > My original patch added code to manage the files for 2 phase > transactions opened by the local server on the remote servers. This > code was mostly inspired from the code in twophase.c which manages the > file for prepared transactions. The logic to manage 2PC files has > changed since [1] and has been optimized. One of the things I wanted > to do is see, if those optimizations are applicable here as well. Have > you considered that? > > Yeah, we're considering it. After these changes are committed, we will post the patch incorporated these changes. But what we need to do first is the discussion in order to get consensus. Since current design of this patch is to transparently execute DCL of 2PC on foreign server, this code changes lot of code and is complicated. Another approach I have is to push down DCL to only foreign servers that support 2PC protocol, which is similar to DML push down. This approach would be more simpler than current idea and is easy to use by distributed transaction manager. I think that it would be good place to start. I'd like to discuss what the best approach is for transaction involving foreign servers. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> My original patch added code to manage the files for 2 phase >> transactions opened by the local server on the remote servers. This >> code was mostly inspired from the code in twophase.c which manages the >> file for prepared transactions. The logic to manage 2PC files has >> changed since [1] and has been optimized. One of the things I wanted >> to do is see, if those optimizations are applicable here as well. Have >> you considered that? >> >> > > Yeah, we're considering it. > After these changes are committed, we will post the patch incorporated > these changes. > > But what we need to do first is the discussion in order to get consensus. > Since current design of this patch is to transparently execute DCL of > 2PC on foreign server, this code changes lot of code and is > complicated. Can you please elaborate. I am not able to understand what DCL is involved here. According to [1], examples of DCL are GRANT and REVOKE command. > Another approach I have is to push down DCL to only foreign servers > that support 2PC protocol, which is similar to DML push down. > This approach would be more simpler than current idea and is easy to > use by distributed transaction manager. Again, can you please elaborate, how that would be different from the current approach and how does it simplify the code. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> My original patch added code to manage the files for 2 phase >>> transactions opened by the local server on the remote servers. This >>> code was mostly inspired from the code in twophase.c which manages the >>> file for prepared transactions. The logic to manage 2PC files has >>> changed since [1] and has been optimized. One of the things I wanted >>> to do is see, if those optimizations are applicable here as well. Have >>> you considered that? >>> >>> >> >> Yeah, we're considering it. >> After these changes are committed, we will post the patch incorporated >> these changes. >> >> But what we need to do first is the discussion in order to get consensus. >> Since current design of this patch is to transparently execute DCL of >> 2PC on foreign server, this code changes lot of code and is >> complicated. > > Can you please elaborate. I am not able to understand what DCL is > involved here. According to [1], examples of DCL are GRANT and REVOKE > command. I meant transaction management command such as PREPARE TRANSACTION and COMMIT/ABORT PREPARED command. The web page I refered might be wrong, sorry. >> Another approach I have is to push down DCL to only foreign servers >> that support 2PC protocol, which is similar to DML push down. >> This approach would be more simpler than current idea and is easy to >> use by distributed transaction manager. > > Again, can you please elaborate, how that would be different from the > current approach and how does it simplify the code. > The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK PREPARED to foreign servers that support 2PC. With this idea, the client need to do following operation when foreign server is involved with transaction. BEGIN; UPDATE parent_table SET ...; -- update including foreign server PREPARE TRANSACTION 'xact_id'; COMMIT PREPARED 'xact_id'; The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed down to foreign server. That is, the client needs to execute PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED explicitly. In this idea, I think that we don't need to do followings, * Providing the prepare id of 2PC. Current patch adds new API prepare_id_provider() but we can use the prepare id of 2PC that is used on parent server. * Keeping track of status of foreign servers. Current patch keeps track of status of foreign servers involved with transaction but this idea is just to push down transaction management command to foreign server. So I think that we no longer need to do that. * Adding max_prepared_foreign_transactions parameter. It means that the number of transaction involving foreign server is the same as max_prepared_transactions. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> My original patch added code to manage the files for 2 phase >>>> transactions opened by the local server on the remote servers. This >>>> code was mostly inspired from the code in twophase.c which manages the >>>> file for prepared transactions. The logic to manage 2PC files has >>>> changed since [1] and has been optimized. One of the things I wanted >>>> to do is see, if those optimizations are applicable here as well. Have >>>> you considered that? >>>> >>>> >>> >>> Yeah, we're considering it. >>> After these changes are committed, we will post the patch incorporated >>> these changes. >>> >>> But what we need to do first is the discussion in order to get consensus. >>> Since current design of this patch is to transparently execute DCL of >>> 2PC on foreign server, this code changes lot of code and is >>> complicated. >> >> Can you please elaborate. I am not able to understand what DCL is >> involved here. According to [1], examples of DCL are GRANT and REVOKE >> command. > > I meant transaction management command such as PREPARE TRANSACTION and > COMMIT/ABORT PREPARED command. > The web page I refered might be wrong, sorry. > >>> Another approach I have is to push down DCL to only foreign servers >>> that support 2PC protocol, which is similar to DML push down. >>> This approach would be more simpler than current idea and is easy to >>> use by distributed transaction manager. >> >> Again, can you please elaborate, how that would be different from the >> current approach and how does it simplify the code. >> > > The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK > PREPARED to foreign servers that support 2PC. > With this idea, the client need to do following operation when foreign > server is involved with transaction. > > BEGIN; > UPDATE parent_table SET ...; -- update including foreign server > PREPARE TRANSACTION 'xact_id'; > COMMIT PREPARED 'xact_id'; > > The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed > down to foreign server. > That is, the client needs to execute PREPARE TRANSACTION and > > In this idea, I think that we don't need to do followings, > > * Providing the prepare id of 2PC. > Current patch adds new API prepare_id_provider() but we can use the > prepare id of 2PC that is used on parent server. > > * Keeping track of status of foreign servers. > Current patch keeps track of status of foreign servers involved with > transaction but this idea is just to push down transaction management > command to foreign server. > So I think that we no longer need to do that. > COMMIT/ROLLBACK PREPARED explicitly. The problem with this approach is same as one previously stated. If the connection between local and foreign server is lost between PREPARE and COMMIT the prepared transaction on the foreign server remains dangling, none other than the local server knows what to do with it and the local server has lost track of the prepared transaction on the foreign server. So, just pushing down those commands doesn't work. > > * Adding max_prepared_foreign_transactions parameter. > It means that the number of transaction involving foreign server is > the same as max_prepared_transactions. > That isn't true exactly. max_prepared_foreign_transactions indicates how many transactions can be prepared on the foreign server, which in the method you propose should have a cap of max_prepared_transactions * number of foreign servers. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, Sep 27, 2016 at 6:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > * Providing the prepare id of 2PC. > Current patch adds new API prepare_id_provider() but we can use the > prepare id of 2PC that is used on parent server. And we assume that when this is used across many servers there will be no GID conflict because each server is careful enough to generate unique strings, say with UUIDs? -- Michael
On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat >>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>> My original patch added code to manage the files for 2 phase >>>>> transactions opened by the local server on the remote servers. This >>>>> code was mostly inspired from the code in twophase.c which manages the >>>>> file for prepared transactions. The logic to manage 2PC files has >>>>> changed since [1] and has been optimized. One of the things I wanted >>>>> to do is see, if those optimizations are applicable here as well. Have >>>>> you considered that? >>>>> >>>>> >>>> >>>> Yeah, we're considering it. >>>> After these changes are committed, we will post the patch incorporated >>>> these changes. >>>> >>>> But what we need to do first is the discussion in order to get consensus. >>>> Since current design of this patch is to transparently execute DCL of >>>> 2PC on foreign server, this code changes lot of code and is >>>> complicated. >>> >>> Can you please elaborate. I am not able to understand what DCL is >>> involved here. According to [1], examples of DCL are GRANT and REVOKE >>> command. >> >> I meant transaction management command such as PREPARE TRANSACTION and >> COMMIT/ABORT PREPARED command. >> The web page I refered might be wrong, sorry. >> >>>> Another approach I have is to push down DCL to only foreign servers >>>> that support 2PC protocol, which is similar to DML push down. >>>> This approach would be more simpler than current idea and is easy to >>>> use by distributed transaction manager. >>> >>> Again, can you please elaborate, how that would be different from the >>> current approach and how does it simplify the code. >>> >> >> The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK >> PREPARED to foreign servers that support 2PC. >> With this idea, the client need to do following operation when foreign >> server is involved with transaction. >> >> BEGIN; >> UPDATE parent_table SET ...; -- update including foreign server >> PREPARE TRANSACTION 'xact_id'; >> COMMIT PREPARED 'xact_id'; >> >> The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed >> down to foreign server. >> That is, the client needs to execute PREPARE TRANSACTION and >> >> In this idea, I think that we don't need to do followings, >> >> * Providing the prepare id of 2PC. >> Current patch adds new API prepare_id_provider() but we can use the >> prepare id of 2PC that is used on parent server. >> >> * Keeping track of status of foreign servers. >> Current patch keeps track of status of foreign servers involved with >> transaction but this idea is just to push down transaction management >> command to foreign server. >> So I think that we no longer need to do that. > >> COMMIT/ROLLBACK PREPARED explicitly. > > The problem with this approach is same as one previously stated. If > the connection between local and foreign server is lost between > PREPARE and COMMIT the prepared transaction on the foreign server > remains dangling, none other than the local server knows what to do > with it and the local server has lost track of the prepared > transaction on the foreign server. So, just pushing down those > commands doesn't work. Yeah, my idea is one of the first step. Mechanism that resolves the dangling foreign transaction and the resolver worker process are necessary. >> >> * Adding max_prepared_foreign_transactions parameter. >> It means that the number of transaction involving foreign server is >> the same as max_prepared_transactions. >> > > That isn't true exactly. max_prepared_foreign_transactions indicates > how many transactions can be prepared on the foreign server, which in > the method you propose should have a cap of max_prepared_transactions > * number of foreign servers. Oh, I understood, thanks. Consider sharding solution using postgres_fdw (that is, the parent postgres server has multiple shard postgres servers), we need to increase max_prepared_foreign_transactions whenever new shard server is added to cluster, or to allocate enough size in advance. But the estimation of enough max_prepared_foreign_transactions would not be easy, for example can we estimate it by (max throughput of the system) * (the number of foreign servers)? One new idea I came up with is that we set transaction id on parent server to global transaction id (gid) that is prepared on shard server. And pg_fdw_resolver worker process periodically resolves the dangling transaction on foreign server by comparing active lowest XID on parent server with the XID in gid used by PREPARE TRANSACTION. For example, suppose that there are one parent server and one shard server, and the client executes update transaction (XID = 100) involving foreign servers. In commit phase, parent server executes PREPARE TRANSACTION command with gid containing 100, say 'px_<random number>_100_<serverid>_<userid>', on foreign server. If the shard server crashed before COMMIT PREPARED, the transaction 100 become danging transaction. But resolver worker process on parent server can resolve it with following steps. 1. Get lowest active XID on parent server(XID=110). 2. Connect to foreign server. (Get foreign server information from pg_foreign_server system catalog.) 3. Check if there is prepared transaction with XID less than 110. 4. Rollback the dangling transaction found at #3 step. gid 'px_<random number>_100_<serverid>_<userid>' is prepared on foreign server by transaction 100, rollback it. In this idea, we need gid provider API but parent server doesn't need to have persistent foreign transaction data. Also we could remove max_prepared_foreign_transactions, and fdw_xact.c would become more simple implementation. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Sep 28, 2016 at 10:43 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat >>>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>>> My original patch added code to manage the files for 2 phase >>>>>> transactions opened by the local server on the remote servers. This >>>>>> code was mostly inspired from the code in twophase.c which manages the >>>>>> file for prepared transactions. The logic to manage 2PC files has >>>>>> changed since [1] and has been optimized. One of the things I wanted >>>>>> to do is see, if those optimizations are applicable here as well. Have >>>>>> you considered that? >>>>>> >>>>>> >>>>> >>>>> Yeah, we're considering it. >>>>> After these changes are committed, we will post the patch incorporated >>>>> these changes. >>>>> >>>>> But what we need to do first is the discussion in order to get consensus. >>>>> Since current design of this patch is to transparently execute DCL of >>>>> 2PC on foreign server, this code changes lot of code and is >>>>> complicated. >>>> >>>> Can you please elaborate. I am not able to understand what DCL is >>>> involved here. According to [1], examples of DCL are GRANT and REVOKE >>>> command. >>> >>> I meant transaction management command such as PREPARE TRANSACTION and >>> COMMIT/ABORT PREPARED command. >>> The web page I refered might be wrong, sorry. >>> >>>>> Another approach I have is to push down DCL to only foreign servers >>>>> that support 2PC protocol, which is similar to DML push down. >>>>> This approach would be more simpler than current idea and is easy to >>>>> use by distributed transaction manager. >>>> >>>> Again, can you please elaborate, how that would be different from the >>>> current approach and how does it simplify the code. >>>> >>> >>> The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK >>> PREPARED to foreign servers that support 2PC. >>> With this idea, the client need to do following operation when foreign >>> server is involved with transaction. >>> >>> BEGIN; >>> UPDATE parent_table SET ...; -- update including foreign server >>> PREPARE TRANSACTION 'xact_id'; >>> COMMIT PREPARED 'xact_id'; >>> >>> The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed >>> down to foreign server. >>> That is, the client needs to execute PREPARE TRANSACTION and >>> >>> In this idea, I think that we don't need to do followings, >>> >>> * Providing the prepare id of 2PC. >>> Current patch adds new API prepare_id_provider() but we can use the >>> prepare id of 2PC that is used on parent server. >>> >>> * Keeping track of status of foreign servers. >>> Current patch keeps track of status of foreign servers involved with >>> transaction but this idea is just to push down transaction management >>> command to foreign server. >>> So I think that we no longer need to do that. >> >>> COMMIT/ROLLBACK PREPARED explicitly. >> >> The problem with this approach is same as one previously stated. If >> the connection between local and foreign server is lost between >> PREPARE and COMMIT the prepared transaction on the foreign server >> remains dangling, none other than the local server knows what to do >> with it and the local server has lost track of the prepared >> transaction on the foreign server. So, just pushing down those >> commands doesn't work. > > Yeah, my idea is one of the first step. > Mechanism that resolves the dangling foreign transaction and the > resolver worker process are necessary. > >>> >>> * Adding max_prepared_foreign_transactions parameter. >>> It means that the number of transaction involving foreign server is >>> the same as max_prepared_transactions. >>> >> >> That isn't true exactly. max_prepared_foreign_transactions indicates >> how many transactions can be prepared on the foreign server, which in >> the method you propose should have a cap of max_prepared_transactions >> * number of foreign servers. > > Oh, I understood, thanks. > > Consider sharding solution using postgres_fdw (that is, the parent > postgres server has multiple shard postgres servers), we need to > increase max_prepared_foreign_transactions whenever new shard server > is added to cluster, or to allocate enough size in advance. But the > estimation of enough max_prepared_foreign_transactions would not be > easy, for example can we estimate it by (max throughput of the system) > * (the number of foreign servers)? > > One new idea I came up with is that we set transaction id on parent > server to global transaction id (gid) that is prepared on shard > server. > And pg_fdw_resolver worker process periodically resolves the dangling > transaction on foreign server by comparing active lowest XID on parent > server with the XID in gid used by PREPARE TRANSACTION. > > For example, suppose that there are one parent server and one shard > server, and the client executes update transaction (XID = 100) > involving foreign servers. > In commit phase, parent server executes PREPARE TRANSACTION command > with gid containing 100, say 'px_<random > number>_100_<serverid>_<userid>', on foreign server. > If the shard server crashed before COMMIT PREPARED, the transaction > 100 become danging transaction. > > But resolver worker process on parent server can resolve it with > following steps. > 1. Get lowest active XID on parent server(XID=110). > 2. Connect to foreign server. (Get foreign server information from > pg_foreign_server system catalog.) > 3. Check if there is prepared transaction with XID less than 110. > 4. Rollback the dangling transaction found at #3 step. > gid 'px_<random number>_100_<serverid>_<userid>' is prepared on > foreign server by transaction 100, rollback it. Why always rollback any dangling transaction? There can be a case that a foreign server has a dangling transaction which needs to be committed because the portions of that transaction on the other shards are committed. The way gid is crafted, there is no way to check whether the given prepared transaction was created by the local server or not. Probably the local server needs to add a unique signature in GID to identify the transactions prepared by itself. That signature should be transferred to standby to cope up with the fail-over of local server. In this idea, one has to keep on polling the foreign server to find any dangling transactions. In usual scenario, we shouldn't have a large number of dangling transactions, and thus periodic polling might be a waste. > > In this idea, we need gid provider API but parent server doesn't need > to have persistent foreign transaction data. > Also we could remove max_prepared_foreign_transactions, and fdw_xact.c > would become more simple implementation. > I agree, but we need to cope with above two problems. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Sep 28, 2016 at 3:30 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I agree, but we need to cope with above two problems. I have marked the patch as returned with feedback per the last output Ashutosh has provided. -- Michael
On Wed, Sep 28, 2016 at 3:30 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Sep 28, 2016 at 10:43 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>> On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat >>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>> On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>>> On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat >>>>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>>>> My original patch added code to manage the files for 2 phase >>>>>>> transactions opened by the local server on the remote servers. This >>>>>>> code was mostly inspired from the code in twophase.c which manages the >>>>>>> file for prepared transactions. The logic to manage 2PC files has >>>>>>> changed since [1] and has been optimized. One of the things I wanted >>>>>>> to do is see, if those optimizations are applicable here as well. Have >>>>>>> you considered that? >>>>>>> >>>>>>> >>>>>> >>>>>> Yeah, we're considering it. >>>>>> After these changes are committed, we will post the patch incorporated >>>>>> these changes. >>>>>> >>>>>> But what we need to do first is the discussion in order to get consensus. >>>>>> Since current design of this patch is to transparently execute DCL of >>>>>> 2PC on foreign server, this code changes lot of code and is >>>>>> complicated. >>>>> >>>>> Can you please elaborate. I am not able to understand what DCL is >>>>> involved here. According to [1], examples of DCL are GRANT and REVOKE >>>>> command. >>>> >>>> I meant transaction management command such as PREPARE TRANSACTION and >>>> COMMIT/ABORT PREPARED command. >>>> The web page I refered might be wrong, sorry. >>>> >>>>>> Another approach I have is to push down DCL to only foreign servers >>>>>> that support 2PC protocol, which is similar to DML push down. >>>>>> This approach would be more simpler than current idea and is easy to >>>>>> use by distributed transaction manager. >>>>> >>>>> Again, can you please elaborate, how that would be different from the >>>>> current approach and how does it simplify the code. >>>>> >>>> >>>> The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK >>>> PREPARED to foreign servers that support 2PC. >>>> With this idea, the client need to do following operation when foreign >>>> server is involved with transaction. >>>> >>>> BEGIN; >>>> UPDATE parent_table SET ...; -- update including foreign server >>>> PREPARE TRANSACTION 'xact_id'; >>>> COMMIT PREPARED 'xact_id'; >>>> >>>> The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed >>>> down to foreign server. >>>> That is, the client needs to execute PREPARE TRANSACTION and >>>> >>>> In this idea, I think that we don't need to do followings, >>>> >>>> * Providing the prepare id of 2PC. >>>> Current patch adds new API prepare_id_provider() but we can use the >>>> prepare id of 2PC that is used on parent server. >>>> >>>> * Keeping track of status of foreign servers. >>>> Current patch keeps track of status of foreign servers involved with >>>> transaction but this idea is just to push down transaction management >>>> command to foreign server. >>>> So I think that we no longer need to do that. >>> >>>> COMMIT/ROLLBACK PREPARED explicitly. >>> >>> The problem with this approach is same as one previously stated. If >>> the connection between local and foreign server is lost between >>> PREPARE and COMMIT the prepared transaction on the foreign server >>> remains dangling, none other than the local server knows what to do >>> with it and the local server has lost track of the prepared >>> transaction on the foreign server. So, just pushing down those >>> commands doesn't work. >> >> Yeah, my idea is one of the first step. >> Mechanism that resolves the dangling foreign transaction and the >> resolver worker process are necessary. >> >>>> >>>> * Adding max_prepared_foreign_transactions parameter. >>>> It means that the number of transaction involving foreign server is >>>> the same as max_prepared_transactions. >>>> >>> >>> That isn't true exactly. max_prepared_foreign_transactions indicates >>> how many transactions can be prepared on the foreign server, which in >>> the method you propose should have a cap of max_prepared_transactions >>> * number of foreign servers. >> >> Oh, I understood, thanks. >> >> Consider sharding solution using postgres_fdw (that is, the parent >> postgres server has multiple shard postgres servers), we need to >> increase max_prepared_foreign_transactions whenever new shard server >> is added to cluster, or to allocate enough size in advance. But the >> estimation of enough max_prepared_foreign_transactions would not be >> easy, for example can we estimate it by (max throughput of the system) >> * (the number of foreign servers)? >> >> One new idea I came up with is that we set transaction id on parent >> server to global transaction id (gid) that is prepared on shard >> server. >> And pg_fdw_resolver worker process periodically resolves the dangling >> transaction on foreign server by comparing active lowest XID on parent >> server with the XID in gid used by PREPARE TRANSACTION. >> >> For example, suppose that there are one parent server and one shard >> server, and the client executes update transaction (XID = 100) >> involving foreign servers. >> In commit phase, parent server executes PREPARE TRANSACTION command >> with gid containing 100, say 'px_<random >> number>_100_<serverid>_<userid>', on foreign server. >> If the shard server crashed before COMMIT PREPARED, the transaction >> 100 become danging transaction. >> >> But resolver worker process on parent server can resolve it with >> following steps. >> 1. Get lowest active XID on parent server(XID=110). >> 2. Connect to foreign server. (Get foreign server information from >> pg_foreign_server system catalog.) >> 3. Check if there is prepared transaction with XID less than 110. >> 4. Rollback the dangling transaction found at #3 step. >> gid 'px_<random number>_100_<serverid>_<userid>' is prepared on >> foreign server by transaction 100, rollback it. > > Why always rollback any dangling transaction? There can be a case that > a foreign server has a dangling transaction which needs to be > committed because the portions of that transaction on the other shards > are committed. Right, we can heuristically make a decision whether we do COMMIT or ABORT on local server. For example, if COMMIT PREPARED succeeded on at least one foreign server, the local server return OK to client and the other dangling transactions should be committed later. We can find out that we should do either commit or abort the dangling transaction by checking CLOG. But we need to handle the case where the CLOG file containing XID necessary for resolving dangling transaction is truncated. If the user does VACUUM FREEZE just after remote server crashed, it could be truncated. > The way gid is crafted, there is no way to check whether the given > prepared transaction was created by the local server or not. Probably > the local server needs to add a unique signature in GID to identify > the transactions prepared by itself. That signature should be > transferred to standby to cope up with the fail-over of local server. Maybe we can use database system identifier in control file. > In this idea, one has to keep on polling the foreign server to find > any dangling transactions. In usual scenario, we shouldn't have a > large number of dangling transactions, and thus periodic polling might > be a waste. We can optimize it by storing the XID that is resolved heuristically into the control file or system catalog, for example. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
>> >> Why always rollback any dangling transaction? There can be a case that >> a foreign server has a dangling transaction which needs to be >> committed because the portions of that transaction on the other shards >> are committed. > > Right, we can heuristically make a decision whether we do COMMIT or > ABORT on local server. > For example, if COMMIT PREPARED succeeded on at least one foreign > server, the local server return OK to client and the other dangling > transactions should be committed later. > We can find out that we should do either commit or abort the dangling > transaction by checking CLOG. Heuristics can not become the default behavior. A user should be given an option to choose a heuristic, and he should be aware of the pitfalls when using this heuristic. I guess, first, we need to get a solution which ensures that the transaction gets committed on all the servers or is rolled back on all the foreign servers involved. AFAIR, my patch did that. Once we have that kind of solution, we can think about heuristics. > > But we need to handle the case where the CLOG file containing XID > necessary for resolving dangling transaction is truncated. > If the user does VACUUM FREEZE just after remote server crashed, it > could be truncated. Hmm, this needs to be fixed. Even my patch relied on XID to determine whether the transaction committed or rolled back locally and thus to decide whether it should be committed or rolled back on all the foreign servers involved. I think I had taken care of the issue you have pointed out here. Can you please verify the same? > >> The way gid is crafted, there is no way to check whether the given >> prepared transaction was created by the local server or not. Probably >> the local server needs to add a unique signature in GID to identify >> the transactions prepared by itself. That signature should be >> transferred to standby to cope up with the fail-over of local server. > > Maybe we can use database system identifier in control file. may be. > >> In this idea, one has to keep on polling the foreign server to find >> any dangling transactions. In usual scenario, we shouldn't have a >> large number of dangling transactions, and thus periodic polling might >> be a waste. > > We can optimize it by storing the XID that is resolved heuristically > into the control file or system catalog, for example. > There will be many such XIDs. We don't want to dump so many things in control file, esp. when that's not control data. System catalog is out of question since a rollback of local transaction would make those rows in the system catalog invisible. That's the reason, why I chose to write the foreign prepared transactions to files rather than a system catalog. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Hi, On 2016/10/04 13:26, Ashutosh Bapat wrote: >>> >>> Why always rollback any dangling transaction? There can be a case that >>> a foreign server has a dangling transaction which needs to be >>> committed because the portions of that transaction on the other shards >>> are committed. >> >> Right, we can heuristically make a decision whether we do COMMIT or >> ABORT on local server. >> For example, if COMMIT PREPARED succeeded on at least one foreign >> server, the local server return OK to client and the other dangling >> transactions should be committed later. >> We can find out that we should do either commit or abort the dangling >> transaction by checking CLOG. > > Heuristics can not become the default behavior. A user should be given > an option to choose a heuristic, and he should be aware of the > pitfalls when using this heuristic. I guess, first, we need to get a > solution which ensures that the transaction gets committed on all the > servers or is rolled back on all the foreign servers involved. AFAIR, > my patch did that. Once we have that kind of solution, we can think > about heuristics. I wonder if Sawada-san is referring to some sort of quorum-based (atomic) commitment protocol [1, 2], although I agree that that would be an advanced technique for handling the limitations such as blocking nature of the basic two-phase commit protocol in case of communication failures, IOW, meant for better availability rather than correctness. Thanks, Amit [1] https://en.wikipedia.org/wiki/Quorum_(distributed_computing)#Quorum-based_voting_in_commit_protocols [2] http://hub.hku.hk/bitstream/10722/158032/1/Content.pdf
<br /><br /> On Tue, Oct 4, 2016 at 1:26 PM, Ashutosh Bapat <<a href="javascript:;">ashutosh.bapat@enterprisedb.com</a>>wrote:<br /> >>><br /> >>> Why always rollbackany dangling transaction? There can be a case that<br /> >>> a foreign server has a dangling transactionwhich needs to be<br /> >>> committed because the portions of that transaction on the other shards<br/> >>> are committed.<br /> >><br /> >> Right, we can heuristically make a decision whetherwe do COMMIT or<br /> >> ABORT on local server.<br /> >> For example, if COMMIT PREPARED succeeded onat least one foreign<br /> >> server, the local server return OK to client and the other dangling<br /> >>transactions should be committed later.<br /> >> We can find out that we should do either commit or abort thedangling<br /> >> transaction by checking CLOG.<br /> ><br /> > Heuristics can not become the default behavior.A user should be given<br /> > an option to choose a heuristic, and he should be aware of the<br /> > pitfallswhen using this heuristic. I guess, first, we need to get a<br /> > solution which ensures that the transactiongets committed on all the<br /> > servers or is rolled back on all the foreign servers involved. AFAIR,<br/> > my patch did that. Once we have that kind of solution, we can think<br /> > about heuristics.<br /><br/> I meant that we could determine it heuristically only when remote server crashed in 2nd phase of 2PC.<br />For example,what does the local server returns to client when no one remote server returns OK to local server in 2nd phase of2PC for more than statement_timeout seconds? Ok or error?<br /><br /> >><br /> >> But we need to handle thecase where the CLOG file containing XID<br /> >> necessary for resolving dangling transaction is truncated.<br />>> If the user does VACUUM FREEZE just after remote server crashed, it<br /> >> could be truncated.<br /> ><br/> > Hmm, this needs to be fixed. Even my patch relied on XID to determine<br /> > whether the transaction committedor rolled back locally and thus to<br /> > decide whether it should be committed or rolled back on all the<br/> > foreign servers involved. I think I had taken care of the issue you<br /> > have pointed out here. Can youplease verify the same?<br /> ><br /> >><br /> >>> The way gid is crafted, there is no way to checkwhether the given<br /> >>> prepared transaction was created by the local server or not. Probably<br /> >>>the local server needs to add a unique signature in GID to identify<br /> >>> the transactions preparedby itself. That signature should be<br /> >>> transferred to standby to cope up with the fail-over of localserver.<br /> >><br /> >> Maybe we can use database system identifier in control file.<br /> ><br />> may be.<br /> ><br /> >><br /> >>> In this idea, one has to keep on polling the foreign server tofind<br /> >>> any dangling transactions. In usual scenario, we shouldn't have a<br /> >>> large numberof dangling transactions, and thus periodic polling might<br /> >>> be a waste.<br /> >><br /> >>We can optimize it by storing the XID that is resolved heuristically<br /> >> into the control file or systemcatalog, for example.<br /> >><br /> ><br /> > There will be many such XIDs. We don't want to dump so manythings in<br /> > control file, esp. when that's not control data. System catalog is out<br /> > of question sincea rollback of local transaction would make those<br /> > rows in the system catalog invisible. That's the reason,why I chose<br /> > to write the foreign prepared transactions to files rather than a<br /> > system catalog.<br/> ><br /><br /> We can store the lowest in-doubt transaction id (say in-doubt XID) that needs to be resolvedlater into control file and the CLOG containing XID greater than in-doubt XID is never truncated.<br /> We need totry to solve such transaction only when in-doubt XID is not NULL.<br /><br /> Regards,<br /><br /> --<br /> Masahiko Sawada<br/> NIPPON TELEGRAPH AND TELEPHONE CORPORATION<br /> NTT Open Source Software Center<br /><br /><br />-- <br />Regards,<br/><br />--<br />Masahiko Sawada<br />NIPPON TELEGRAPH AND TELEPHONE CORPORATION<br />NTT Open Source SoftwareCenter <br />
>> >> Heuristics can not become the default behavior. A user should be given >> an option to choose a heuristic, and he should be aware of the >> pitfalls when using this heuristic. I guess, first, we need to get a >> solution which ensures that the transaction gets committed on all the >> servers or is rolled back on all the foreign servers involved. AFAIR, >> my patch did that. Once we have that kind of solution, we can think >> about heuristics. > > I meant that we could determine it heuristically only when remote server > crashed in 2nd phase of 2PC. > For example, what does the local server returns to client when no one remote > server returns OK to local server in 2nd phase of 2PC for more than > statement_timeout seconds? Ok or error? > The local server doesn't wait for the completion of the second phase to finish the currently running statement. Once all the foreign servers have responded to PREPARE request in the first phase, the local server responds to the client. Am I missing something? >> >> There will be many such XIDs. We don't want to dump so many things in >> control file, esp. when that's not control data. System catalog is out >> of question since a rollback of local transaction would make those >> rows in the system catalog invisible. That's the reason, why I chose >> to write the foreign prepared transactions to files rather than a >> system catalog. >> > > We can store the lowest in-doubt transaction id (say in-doubt XID) that > needs to be resolved later into control file and the CLOG containing XID > greater than in-doubt XID is never truncated. > We need to try to solve such transaction only when in-doubt XID is not NULL. > IIRC, my patch takes care of this. If the oldest active transaction happens to be later in the time line than the oldest in-doubt transaction, it sets oldest active transaction id to that of the oldest in-doubt transaction. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2016/10/04 16:10, Ashutosh Bapat wrote: >>> Heuristics can not become the default behavior. A user should be given >>> an option to choose a heuristic, and he should be aware of the >>> pitfalls when using this heuristic. I guess, first, we need to get a >>> solution which ensures that the transaction gets committed on all the >>> servers or is rolled back on all the foreign servers involved. AFAIR, >>> my patch did that. Once we have that kind of solution, we can think >>> about heuristics. >> >> I meant that we could determine it heuristically only when remote server >> crashed in 2nd phase of 2PC. >> For example, what does the local server returns to client when no one remote >> server returns OK to local server in 2nd phase of 2PC for more than >> statement_timeout seconds? Ok or error? >> > > The local server doesn't wait for the completion of the second phase > to finish the currently running statement. Once all the foreign > servers have responded to PREPARE request in the first phase, the > local server responds to the client. Am I missing something? PREPARE sent to foreign servers involved in a given transaction is *transparent* to the user who started the transaction, no? That is, user just says COMMIT and if it is found that there are multiple servers involved in the transaction, it must be handled using two-phase commit protocol *behind the scenes*. So the aforementioned COMMIT should not return to the client until after the above two-phase commit processing has finished. Or are you and Sawada-san talking about the case where the user issued PREPARE and not COMMIT? Thanks, Amit
On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2016/10/04 16:10, Ashutosh Bapat wrote: >>>> Heuristics can not become the default behavior. A user should be given >>>> an option to choose a heuristic, and he should be aware of the >>>> pitfalls when using this heuristic. I guess, first, we need to get a >>>> solution which ensures that the transaction gets committed on all the >>>> servers or is rolled back on all the foreign servers involved. AFAIR, >>>> my patch did that. Once we have that kind of solution, we can think >>>> about heuristics. >>> >>> I meant that we could determine it heuristically only when remote server >>> crashed in 2nd phase of 2PC. >>> For example, what does the local server returns to client when no one remote >>> server returns OK to local server in 2nd phase of 2PC for more than >>> statement_timeout seconds? Ok or error? >>> >> >> The local server doesn't wait for the completion of the second phase >> to finish the currently running statement. Once all the foreign >> servers have responded to PREPARE request in the first phase, the >> local server responds to the client. Am I missing something? > > PREPARE sent to foreign servers involved in a given transaction is > *transparent* to the user who started the transaction, no? That is, user > just says COMMIT and if it is found that there are multiple servers > involved in the transaction, it must be handled using two-phase commit > protocol *behind the scenes*. So the aforementioned COMMIT should not > return to the client until after the above two-phase commit processing has > finished. No, the COMMIT returns after the first phase. It can not wait for all the foreign servers to complete their second phase, which can take quite long (or never) if one of the servers has crashed in between. > > Or are you and Sawada-san talking about the case where the user issued > PREPARE and not COMMIT? I guess, Sawada-san is still talking about the user issued PREPARE. But my comment is applicable otherwise as well. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Tue, Oct 4, 2016 at 8:29 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2016/10/04 16:10, Ashutosh Bapat wrote: >>>>> Heuristics can not become the default behavior. A user should be given >>>>> an option to choose a heuristic, and he should be aware of the >>>>> pitfalls when using this heuristic. I guess, first, we need to get a >>>>> solution which ensures that the transaction gets committed on all the >>>>> servers or is rolled back on all the foreign servers involved. AFAIR, >>>>> my patch did that. Once we have that kind of solution, we can think >>>>> about heuristics. >>>> >>>> I meant that we could determine it heuristically only when remote server >>>> crashed in 2nd phase of 2PC. >>>> For example, what does the local server returns to client when no one remote >>>> server returns OK to local server in 2nd phase of 2PC for more than >>>> statement_timeout seconds? Ok or error? >>>> >>> >>> The local server doesn't wait for the completion of the second phase >>> to finish the currently running statement. Once all the foreign >>> servers have responded to PREPARE request in the first phase, the >>> local server responds to the client. Am I missing something? >> >> PREPARE sent to foreign servers involved in a given transaction is >> *transparent* to the user who started the transaction, no? That is, user >> just says COMMIT and if it is found that there are multiple servers >> involved in the transaction, it must be handled using two-phase commit >> protocol *behind the scenes*. So the aforementioned COMMIT should not >> return to the client until after the above two-phase commit processing has >> finished. > > No, the COMMIT returns after the first phase. It can not wait for all > the foreign servers to complete their second phase Hm, it sounds like it's same as normal commit (not 2PC). What's the difference? My understanding is that basically the local server can not return COMMIT to the client until 2nd phase is completed. Otherwise the next transaction can see data that is not committed yet on remote server. > , which can take > quite long (or never) if one of the servers has crashed in between. > >> >> Or are you and Sawada-san talking about the case where the user issued >> PREPARE and not COMMIT? > > I guess, Sawada-san is still talking about the user issued PREPARE. > But my comment is applicable otherwise as well. > Yes, I'm considering the case where the local server tries to COMMIT but the remote server crashed after the local server completes 1st phase (PREPARE) on the all remote server. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
>> >> No, the COMMIT returns after the first phase. It can not wait for all >> the foreign servers to complete their second phase > > Hm, it sounds like it's same as normal commit (not 2PC). > What's the difference? > > My understanding is that basically the local server can not return > COMMIT to the client until 2nd phase is completed. If we do that, the local server may not return to the client at all, if the foreign server crashes and never comes up. Practically, it may take much longer to finish a COMMIT, depending upon how long it takes for the foreign server to reply to a COMMIT message. I don't think that's desirable. > Otherwise the next transaction can see data that is not committed yet > on remote server. 2PC doesn't guarantee transactional consistency all by itself. It only guarantees that all legs of a distributed transaction are either all rolled back or all committed. IOW, it guarantees that a distributed transaction is not rolled back on some nodes and committed on the other node. Providing a transactionally consistent view is a very hard problem. Trying to solve all those problems in a single patch would be very difficult and the amount of changes required may be really huge. Then there are many possible consistency definitions when it comes to consistency of distributed system. I have not seen a consensus on what kind of consistency model/s we want to support in PostgreSQL. That's another large debate. We have had previous attempts where people have tried to complete everything in one go and nothing has been completed yet. 2PC implementation OR guaranteeing that all the legs of a transaction commit or roll back, is an essential block of any kind of distributed transaction manager. So, we should at least support that one, before attacking further problems. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>> >>> No, the COMMIT returns after the first phase. It can not wait for all >>> the foreign servers to complete their second phase >> >> Hm, it sounds like it's same as normal commit (not 2PC). >> What's the difference? >> >> My understanding is that basically the local server can not return >> COMMIT to the client until 2nd phase is completed. > > > If we do that, the local server may not return to the client at all, > if the foreign server crashes and never comes up. Practically, it may > take much longer to finish a COMMIT, depending upon how long it takes > for the foreign server to reply to a COMMIT message. Yes, I think 2PC behaves so, please refer to [1]. To prevent local server stops forever due to communication failure., we could provide the timeout on coordinator side or on participant side. > >> Otherwise the next transaction can see data that is not committed yet >> on remote server. > > 2PC doesn't guarantee transactional consistency all by itself. It only > guarantees that all legs of a distributed transaction are either all > rolled back or all committed. IOW, it guarantees that a distributed > transaction is not rolled back on some nodes and committed on the > other node. > Providing a transactionally consistent view is a very hard problem. > Trying to solve all those problems in a single patch would be very > difficult and the amount of changes required may be really huge. Then > there are many possible consistency definitions when it comes to > consistency of distributed system. I have not seen a consensus on what > kind of consistency model/s we want to support in PostgreSQL. That's > another large debate. We have had previous attempts where people have > tried to complete everything in one go and nothing has been completed > yet. Yes, providing a atomic visibility is hard problem, and it's a separated issue[2]. > 2PC implementation OR guaranteeing that all the legs of a transaction > commit or roll back, is an essential block of any kind of distributed > transaction manager. So, we should at least support that one, before > attacking further problems. I agree. [1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol [2]http://www.bailis.org/papers/ramp-sigmod2014.pdf Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>>> >>>> No, the COMMIT returns after the first phase. It can not wait for all >>>> the foreign servers to complete their second phase >>> >>> Hm, it sounds like it's same as normal commit (not 2PC). >>> What's the difference? >>> >>> My understanding is that basically the local server can not return >>> COMMIT to the client until 2nd phase is completed. >> >> >> If we do that, the local server may not return to the client at all, >> if the foreign server crashes and never comes up. Practically, it may >> take much longer to finish a COMMIT, depending upon how long it takes >> for the foreign server to reply to a COMMIT message. > > Yes, I think 2PC behaves so, please refer to [1]. > To prevent local server stops forever due to communication failure., > we could provide the timeout on coordinator side or on participant > side. > This too, looks like a heuristic and shouldn't be the default behaviour and hence not part of the first version of this feature. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2016/10/06 17:45, Ashutosh Bapat wrote: > On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>>> My understanding is that basically the local server can not return >>>> COMMIT to the client until 2nd phase is completed. >>> >>> If we do that, the local server may not return to the client at all, >>> if the foreign server crashes and never comes up. Practically, it may >>> take much longer to finish a COMMIT, depending upon how long it takes >>> for the foreign server to reply to a COMMIT message. >> >> Yes, I think 2PC behaves so, please refer to [1]. >> To prevent local server stops forever due to communication failure., >> we could provide the timeout on coordinator side or on participant >> side. > > This too, looks like a heuristic and shouldn't be the default > behaviour and hence not part of the first version of this feature. At any rate, the coordinator should not return to the client until after the 2nd phase is completed, which was the original point. If COMMIT taking longer is an issue, then it could be handled with one of the approaches mentioned so far (even if not in the first version), but no version of this feature should really return COMMIT to the client only after finishing the first phase. Am I missing something? I am saying this because I am assuming that this feature means the client itself does not invoke 2PC, even knowing that there are multiple servers involved, but rather rely on the involved FDW drivers and related core code handling it transparently. I may have misunderstood the feature though, apologies if so. Thanks, Amit
On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2016/10/06 17:45, Ashutosh Bapat wrote: >> On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>>>> My understanding is that basically the local server can not return >>>>> COMMIT to the client until 2nd phase is completed. >>>> >>>> If we do that, the local server may not return to the client at all, >>>> if the foreign server crashes and never comes up. Practically, it may >>>> take much longer to finish a COMMIT, depending upon how long it takes >>>> for the foreign server to reply to a COMMIT message. >>> >>> Yes, I think 2PC behaves so, please refer to [1]. >>> To prevent local server stops forever due to communication failure., >>> we could provide the timeout on coordinator side or on participant >>> side. >> >> This too, looks like a heuristic and shouldn't be the default >> behaviour and hence not part of the first version of this feature. > > At any rate, the coordinator should not return to the client until after > the 2nd phase is completed, which was the original point. If COMMIT > taking longer is an issue, then it could be handled with one of the > approaches mentioned so far (even if not in the first version), but no > version of this feature should really return COMMIT to the client only > after finishing the first phase. Am I missing something? There is small time window between actual COMMIT and a commit message returned. An actual commit happens when we insert a WAL saying transaction X committed and then we return to the client saying a COMMIT happened. Note that a transaction may be committed but we will never return to the client with a commit message, because connection was lost or the server crashed. I hope we agree on this. COMMITTING the foreign prepared transactions happens after we COMMIT the local transaction. If we do it before COMMITTING local transaction and the local server crashes, we will roll back local transaction during subsequence recovery while the foreign segments have committed resulting in an inconsistent state. If we are successful in COMMITTING foreign transactions during post-commit phase, COMMIT message will be returned after we have committed all foreign transactions. But in case we can not reach a foreign server, and request times out, we can not revert back our decision that we are going to commit the transaction. That's my answer to the timeout based heuristic. I don't see much point in holding up post-commit processing for a non-responsive foreign server, which may not respond for days together. Can you please elaborate a use case? Which commercial transaction manager does that? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Oct 7, 2016 at 4:25 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2016/10/06 17:45, Ashutosh Bapat wrote: >>> On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>>>>> My understanding is that basically the local server can not return >>>>>> COMMIT to the client until 2nd phase is completed. >>>>> >>>>> If we do that, the local server may not return to the client at all, >>>>> if the foreign server crashes and never comes up. Practically, it may >>>>> take much longer to finish a COMMIT, depending upon how long it takes >>>>> for the foreign server to reply to a COMMIT message. >>>> >>>> Yes, I think 2PC behaves so, please refer to [1]. >>>> To prevent local server stops forever due to communication failure., >>>> we could provide the timeout on coordinator side or on participant >>>> side. >>> >>> This too, looks like a heuristic and shouldn't be the default >>> behaviour and hence not part of the first version of this feature. >> >> At any rate, the coordinator should not return to the client until after >> the 2nd phase is completed, which was the original point. If COMMIT >> taking longer is an issue, then it could be handled with one of the >> approaches mentioned so far (even if not in the first version), but no >> version of this feature should really return COMMIT to the client only >> after finishing the first phase. Am I missing something? > > There is small time window between actual COMMIT and a commit message > returned. An actual commit happens when we insert a WAL saying > transaction X committed and then we return to the client saying a > COMMIT happened. Note that a transaction may be committed but we will > never return to the client with a commit message, because connection > was lost or the server crashed. I hope we agree on this. Agree. > COMMITTING the foreign prepared transactions happens after we COMMIT > the local transaction. If we do it before COMMITTING local transaction > and the local server crashes, we will roll back local transaction > during subsequence recovery while the foreign segments have committed > resulting in an inconsistent state. > > If we are successful in COMMITTING foreign transactions during > post-commit phase, COMMIT message will be returned after we have > committed all foreign transactions. But in case we can not reach a > foreign server, and request times out, we can not revert back our > decision that we are going to commit the transaction. That's my answer > to the timeout based heuristic. IIUC 2PC is the protocol that assumes that all of the foreign server live. In case we can not reach a foreign server during post-commit phase, basically the transaction and following transaction should stop until the crashed server revived. This is the first place to implement 2PC for FDW, I think. The heuristically determination approach I mentioned is one of the optimization idea to avoid holding up transaction in case a foreign server crashed. > I don't see much point in holding up post-commit processing for a > non-responsive foreign server, which may not respond for days > together. Can you please elaborate a use case? Which commercial > transaction manager does that? For example, the client updates a data on foreign server and then commits. And the next transaction from the same client selects new data which was updated on previous transaction. In this case, because the first transaction is committed the second transaction should be able to see updated data, but it can see old data in your idea. Since these is obviously order between first transaction and second transaction I think that It's not problem of providing consistent view. I guess transaction manager of Postgres-XC behaves so, no? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
>> >> If we are successful in COMMITTING foreign transactions during >> post-commit phase, COMMIT message will be returned after we have >> committed all foreign transactions. But in case we can not reach a >> foreign server, and request times out, we can not revert back our >> decision that we are going to commit the transaction. That's my answer >> to the timeout based heuristic. > > IIUC 2PC is the protocol that assumes that all of the foreign server live. Do you have any references? Take a look at [1]. The first paragraph itself mentions that 2PC can achieve its goals despite temporary failures. > In case we can not reach a foreign server during post-commit phase, > basically the transaction and following transaction should stop until > the crashed server revived. I have repeatedly given reasons why this is not correct. You and Amit seem to repeat this statement again and again in turns without giving any concrete reasons about why this is so. > This is the first place to implement 2PC > for FDW, I think. The heuristically determination approach I mentioned > is one of the optimization idea to avoid holding up transaction in > case a foreign server crashed. > >> I don't see much point in holding up post-commit processing for a >> non-responsive foreign server, which may not respond for days >> together. Can you please elaborate a use case? Which commercial >> transaction manager does that? > > For example, the client updates a data on foreign server and then > commits. And the next transaction from the same client selects new > data which was updated on previous transaction. In this case, because > the first transaction is committed the second transaction should be > able to see updated data, but it can see old data in your idea. Since > these is obviously order between first transaction and second > transaction I think that It's not problem of providing consistent > view. 2PC doesn't guarantee this. For that you need other methods and protocols. We have discussed this before. [2] [1] https://en.wikipedia.org/wiki/Two-phase_commit_protocol [2] https://www.postgresql.org/message-id/CAD21AoCTe1CFfA9g1uqETvLaJZfFH6QoPSDf-L3KZQ-CDZ7q8g%40mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2016/10/13 19:37, Ashutosh Bapat wrote: >> In case we can not reach a foreign server during post-commit phase, >> basically the transaction and following transaction should stop until >> the crashed server revived. > > I have repeatedly given reasons why this is not correct. You and Amit > seem to repeat this statement again and again in turns without giving > any concrete reasons about why this is so. As mentioned in description of the "Commit" or "Completion" phase in the Wikipedia article [1]: * Success If the coordinator received an agreement message from all cohorts during the commit-request phase: 1. The coordinator sends a commit message to all the cohorts. 2. Each cohort completes the operation, and releases all the locks and resources held during the transaction. 3. Each cohort sends an acknowledgment to the coordinator. 4. The coordinator completes the transaction when all acknowledgments have been received. * Failure If any cohort votes No during the commit-request phase (or the coordinator's timeout expires): 1. The coordinator sends a rollback message to all the cohorts. 2. Each cohort undoes the transaction using the undo log, and releases the resources and locks held during the transaction. 3. Each cohort sends an acknowledgement to the coordinator. 4. The coordinator undoes the transaction when all acknowledgements have been received. In point 4 of both commit and abort cases above, it's been said, "when *all* acknowledgements have been received." However, when I briefly read the description in "Transaction Management in the R* Distributed Database Management System (C. Mohan et al)" [2], it seems that what Ashutosh is saying might be a correct way to proceed after all: """ 2. THE TWO-PHASE COMMIT PROTOCOL ... After the coordinator receives the votes from all its subordinates, it initiates the second phase of the protocol. If all the votes were YES VOTES, then the coordinator moves to the committing state by force-writing a commit record and sending COMMIT messages to all the subordinates. The completion of the force-write takes the transaction to its commit point. Once this point is passed the user can be told that the transaction has been committed. ... """ Sorry about the noise. Thanks, Amit [1] https://en.wikipedia.org/wiki/Two-phase_commit_protocol#Commit_phase [2] http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/p378-mohan.pdf
On Thu, Oct 13, 2016 at 7:37 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>> >>> If we are successful in COMMITTING foreign transactions during >>> post-commit phase, COMMIT message will be returned after we have >>> committed all foreign transactions. But in case we can not reach a >>> foreign server, and request times out, we can not revert back our >>> decision that we are going to commit the transaction. That's my answer >>> to the timeout based heuristic. >> >> IIUC 2PC is the protocol that assumes that all of the foreign server live. > > Do you have any references? Take a look at [1]. The first paragraph > itself mentions that 2PC can achieve its goals despite temporary > failures. I guess that It doesn't mention that 2PC can it by ignoring temporary failures. Even by waiting for the crashed server revives, 2PC can achieve its goals. >> In case we can not reach a foreign server during post-commit phase, >> basically the transaction and following transaction should stop until >> the crashed server revived. > > I have repeatedly given reasons why this is not correct. You and Amit > seem to repeat this statement again and again in turns without giving > any concrete reasons about why this is so. > >> This is the first place to implement 2PC >> for FDW, I think. The heuristically determination approach I mentioned >> is one of the optimization idea to avoid holding up transaction in >> case a foreign server crashed. >> >>> I don't see much point in holding up post-commit processing for a >>> non-responsive foreign server, which may not respond for days >>> together. Can you please elaborate a use case? Which commercial >>> transaction manager does that? >> >> For example, the client updates a data on foreign server and then >> commits. And the next transaction from the same client selects new >> data which was updated on previous transaction. In this case, because >> the first transaction is committed the second transaction should be >> able to see updated data, but it can see old data in your idea. Since >> these is obviously order between first transaction and second >> transaction I think that It's not problem of providing consistent >> view. > > 2PC doesn't guarantee this. For that you need other methods and > protocols. We have discussed this before. [2] > At any rate, I think that it would confuse the user that there is no guarantee that the latest data updated by previous transaction can be seen by following transaction. I don't think that it's worth enough to immolate in order to get better performance. Providing atomic visibility for concurrency transaction would be supported later. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > However, when I briefly read the description in "Transaction Management in > the R* Distributed Database Management System (C. Mohan et al)" [2], it > seems that what Ashutosh is saying might be a correct way to proceed after > all: I think Ashutosh is mostly right, but I think there's a lot of room to doubt whether the design of this patch is good enough that we should adopt it. Consider two possible designs. In design #1, the leader performs the commit locally and then tries to send COMMIT PREPARED to every standby server afterward, and only then acknowledges the commit to the client. In design #2, the leader performs the commit locally and then acknowledges the commit to the client at once, leaving the task of running COMMIT PREPARED to some background process. Design #2 involves a race condition, because it's possible that the background process might not complete COMMIT PREPARED on every node before the user submits the next query, and that query might then fail to see supposedly-committed changes. This can't happen in design #1. On the other hand, there's always the possibility that the leader's session is forcibly killed, even perhaps by pulling the plug. If the background process contemplated by design #2 is well-designed, it can recover and finish sending COMMIT PREPARED to each relevant server after the next restart. In design #1, that background process doesn't necessarily exist, so inevitably there is a possibility of orphaning prepared transactions on the remote servers, which is not good. Even if the DBA notices them, it won't be easy to figure out whether to commit them or roll them back. I think this thought experiment shows that, on the one hand, there is a point to waiting for commits on the foreign servers, because it can avoid the anomaly of not seeing the effects of your own commits. On the other hand, it's ridiculous to suppose that every case can be handled by waiting, because that just isn't true. You can't be sure that you'll be able to wait long enough for COMMIT PREPARED to complete, and even if that works out, you may not want to wait indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has no value whatsoever unless the system design is such that failing to wait for it results in the ROLLBACK PREPARED never getting performed -- which is a pretty poor excuse. Moreover, there are good reasons to think that doing this kind of cleanup work in the post-commit hooks is never going to be acceptable. Generally, the post-commit hooks need to be no-fail, because it's too late to throw an ERROR. But there's very little hope that a connection to a remote server can be no-fail; anything that involves a network connection is, by definition, prone to failure. We can try to guarantee that every single bit of code that runs in the path that sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an ERROR, but that's going to be quite difficult to do: even palloc() can throw an error. And what about interrupts? We don't want to be stuck inside this code for a long time without any hope of the user recovering control of the session by pressing ^C, but of course the way that works is it throws an ERROR, which we can't handle here. We fixed a similar issue for synchronous replication in 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a WARNING in that case and then return control to the user. But if we do that here, all of the code that every FDW emits has to be aware of that rule and follow it, and it just adds to the list of ways that the user backend can escape this code without having cleaned up all of the prepared transactions on the remote side. It seems to me that the only way to really make this feature robust is to have a background worker as part of the equation. The background worker launches at startup and looks around for local state that tells it whether there are any COMMIT PREPARED or ROLLBACK PREPARED operations pending that weren't completed during the last server lifetime, whether because of a local crash or remote unavailability. It attempts to complete those and retries periodically. When a new transaction needs this type of coordination, it adds the necessary crash-proof state and then signals the background worker. If appropriate, it can wait for the background worker to complete, just like a CHECKPOINT waits for the checkpointer to finish -- but if the CHECKPOINT command is interrupted, the actual checkpoint is unaffected. More broadly, the question has been raised as to whether it's right to try to handle atomic commit and atomic visibility as two separate problems. The XTM API proposed by Postgres Pro aims to address both with a single stroke. I don't think that API was well-designed, but maybe the idea is good even if the code is not. Generally, there are two ways in which you could imagine that a distributed version of PostgreSQL might work. One possibility is that one node makes everything work by going around and giving instructions to the other nodes, which are more or less unaware that they are part of a cluster. That is basically the design of Postgres-XC and certainly the design being proposed here. The other possibility is that the nodes are actually clustered in some way and agree on things like whether a transaction committed or what snapshot is current using some kind of consensus protocol. It is obviously possible to get a fairly long way using the first approach but it seems likely that the second one is fundamentally more powerful: among other things, because the first approach is so centralized, the leader is apt to become a bottleneck. And, quite apart from that, can a centralized architecture with the leader manipulating the other workers ever allow for atomic visibility? If atomic visibility can build on top of atomic commit, then it makes sense to do atomic commit first, but if we build this infrastructure and then find that we need an altogether different solution for atomic visibility, that will be unfortunate. I know I was one of the people initially advocating this approach, but I'm no longer convinced that it's going to work out well. I don't mean that we should abandon all work on this topic, or even less all discussion, but I think we should be careful not to get so sucked into the details of perfecting this particular patch that we ignore the bigger design questions here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 19, 2016 at 11:47:25AM -0400, Robert Haas wrote: > It seems to me that the only way to really make this feature robust is > to have a background worker as part of the equation. The background > worker launches at startup and looks around for local state that tells > it whether there are any COMMIT PREPARED or ROLLBACK PREPARED > operations pending that weren't completed during the last server > lifetime, whether because of a local crash or remote unavailability. Yes, you really need both commit on foreign servers before acknowledging commit to the client, and a background process to clean things up from an abandoned server. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> However, when I briefly read the description in "Transaction Management in >> the R* Distributed Database Management System (C. Mohan et al)" [2], it >> seems that what Ashutosh is saying might be a correct way to proceed after >> all: > > I think Ashutosh is mostly right, but I think there's a lot of room to > doubt whether the design of this patch is good enough that we should > adopt it. > > Consider two possible designs. In design #1, the leader performs the > commit locally and then tries to send COMMIT PREPARED to every standby > server afterward, and only then acknowledges the commit to the client. > In design #2, the leader performs the commit locally and then > acknowledges the commit to the client at once, leaving the task of > running COMMIT PREPARED to some background process. Design #2 > involves a race condition, because it's possible that the background > process might not complete COMMIT PREPARED on every node before the > user submits the next query, and that query might then fail to see > supposedly-committed changes. This can't happen in design #1. On the > other hand, there's always the possibility that the leader's session > is forcibly killed, even perhaps by pulling the plug. If the > background process contemplated by design #2 is well-designed, it can > recover and finish sending COMMIT PREPARED to each relevant server > after the next restart. In design #1, that background process doesn't > necessarily exist, so inevitably there is a possibility of orphaning > prepared transactions on the remote servers, which is not good. Even > if the DBA notices them, it won't be easy to figure out whether to > commit them or roll them back. > > I think this thought experiment shows that, on the one hand, there is > a point to waiting for commits on the foreign servers, because it can > avoid the anomaly of not seeing the effects of your own commits. On > the other hand, it's ridiculous to suppose that every case can be > handled by waiting, because that just isn't true. You can't be sure > that you'll be able to wait long enough for COMMIT PREPARED to > complete, and even if that works out, you may not want to wait > indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has > no value whatsoever unless the system design is such that failing to > wait for it results in the ROLLBACK PREPARED never getting performed > -- which is a pretty poor excuse. > > Moreover, there are good reasons to think that doing this kind of > cleanup work in the post-commit hooks is never going to be acceptable. > Generally, the post-commit hooks need to be no-fail, because it's too > late to throw an ERROR. But there's very little hope that a > connection to a remote server can be no-fail; anything that involves a > network connection is, by definition, prone to failure. We can try to > guarantee that every single bit of code that runs in the path that > sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an > ERROR, but that's going to be quite difficult to do: even palloc() can > throw an error. And what about interrupts? We don't want to be stuck > inside this code for a long time without any hope of the user > recovering control of the session by pressing ^C, but of course the > way that works is it throws an ERROR, which we can't handle here. We > fixed a similar issue for synchronous replication in > 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a > WARNING in that case and then return control to the user. But if we > do that here, all of the code that every FDW emits has to be aware of > that rule and follow it, and it just adds to the list of ways that the > user backend can escape this code without having cleaned up all of the > prepared transactions on the remote side. Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak, tries to resolve prepared transactions in post-commit code. I agree with you here, that it should be avoided and the backend should take over the job of resolving transactions. > > It seems to me that the only way to really make this feature robust is > to have a background worker as part of the equation. The background > worker launches at startup and looks around for local state that tells > it whether there are any COMMIT PREPARED or ROLLBACK PREPARED > operations pending that weren't completed during the last server > lifetime, whether because of a local crash or remote unavailability. > It attempts to complete those and retries periodically. When a new > transaction needs this type of coordination, it adds the necessary > crash-proof state and then signals the background worker. If > appropriate, it can wait for the background worker to complete, just > like a CHECKPOINT waits for the checkpointer to finish -- but if the > CHECKPOINT command is interrupted, the actual checkpoint is > unaffected. My patch and hence patch by Masahiko-san and Vinayak have the background worker in the equation. The background worker tries to resolve prepared transactions on the foreign server periodically. IIRC, sending it a signal when another backend creates foreign prepared transactions is not implemented. That may be a good addition. > > More broadly, the question has been raised as to whether it's right to > try to handle atomic commit and atomic visibility as two separate > problems. The XTM API proposed by Postgres Pro aims to address both > with a single stroke. I don't think that API was well-designed, but > maybe the idea is good even if the code is not. Generally, there are > two ways in which you could imagine that a distributed version of > PostgreSQL might work. One possibility is that one node makes > everything work by going around and giving instructions to the other > nodes, which are more or less unaware that they are part of a cluster. > That is basically the design of Postgres-XC and certainly the design > being proposed here. The other possibility is that the nodes are > actually clustered in some way and agree on things like whether a > transaction committed or what snapshot is current using some kind of > consensus protocol. It is obviously possible to get a fairly long way > using the first approach but it seems likely that the second one is > fundamentally more powerful: among other things, because the first > approach is so centralized, the leader is apt to become a bottleneck. > And, quite apart from that, can a centralized architecture with the > leader manipulating the other workers ever allow for atomic > visibility? If atomic visibility can build on top of atomic commit, > then it makes sense to do atomic commit first, but if we build this > infrastructure and then find that we need an altogether different > solution for atomic visibility, that will be unfortunate. > There are two problems to solve as far as visibility is concerned. 1. Consistency: changes by which transactions are visible to a given transaction 2. Making visible, the changes by all the segments of a given distributed transaction on different foreign servers, at the same time IOW no other transaction sees changes by only few segments but does not see changes by all the transactions. First problem is hard to solve and there are many consistency symantics. A large topic of discussion. The second problem can be solved on top of this infrastructure by extending PREPARE transaction API. I am writing down my ideas so that they don't get lost. It's not a completed design. Assume that we have syntax which tells the originating server which prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local server name> with ID <xid> ,where xid is the transaction identifier on local server. OR we may incorporate that information in GID itself and the foreign server knows how to decode it. Once we have that information, the foreign server can actively poll the local server to get the status of transaction xid and resolves the prepared transaction itself. It can go a step further and inform the local server that it has resolved the transaction, so that the local server can purge it from it's own state. It can remember the fate of xid, which can be consulted by another foreign server if the local server is down. If another transaction on the foreign server stumbles on a transaction prepared (but not resolved) by the local server, foreign server has two options - 1. consult the local server and resolve 2. if the first options fails to get the status of xid or that if that option is not workable, throw an error e.g. indoubt transaction. There is probably more network traffic happening here. Usually, the local server should be able to resolve the transaction before any other transaction stumbles upon it. The overhead is incurred only when necessary. -- Best Wishes Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> However, when I briefly read the description in "Transaction Management in >>> the R* Distributed Database Management System (C. Mohan et al)" [2], it >>> seems that what Ashutosh is saying might be a correct way to proceed after >>> all: >> >> I think Ashutosh is mostly right, but I think there's a lot of room to >> doubt whether the design of this patch is good enough that we should >> adopt it. >> >> Consider two possible designs. In design #1, the leader performs the >> commit locally and then tries to send COMMIT PREPARED to every standby >> server afterward, and only then acknowledges the commit to the client. >> In design #2, the leader performs the commit locally and then >> acknowledges the commit to the client at once, leaving the task of >> running COMMIT PREPARED to some background process. Design #2 >> involves a race condition, because it's possible that the background >> process might not complete COMMIT PREPARED on every node before the >> user submits the next query, and that query might then fail to see >> supposedly-committed changes. This can't happen in design #1. On the >> other hand, there's always the possibility that the leader's session >> is forcibly killed, even perhaps by pulling the plug. If the >> background process contemplated by design #2 is well-designed, it can >> recover and finish sending COMMIT PREPARED to each relevant server >> after the next restart. In design #1, that background process doesn't >> necessarily exist, so inevitably there is a possibility of orphaning >> prepared transactions on the remote servers, which is not good. Even >> if the DBA notices them, it won't be easy to figure out whether to >> commit them or roll them back. >> >> I think this thought experiment shows that, on the one hand, there is >> a point to waiting for commits on the foreign servers, because it can >> avoid the anomaly of not seeing the effects of your own commits. On >> the other hand, it's ridiculous to suppose that every case can be >> handled by waiting, because that just isn't true. You can't be sure >> that you'll be able to wait long enough for COMMIT PREPARED to >> complete, and even if that works out, you may not want to wait >> indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has >> no value whatsoever unless the system design is such that failing to >> wait for it results in the ROLLBACK PREPARED never getting performed >> -- which is a pretty poor excuse. >> >> Moreover, there are good reasons to think that doing this kind of >> cleanup work in the post-commit hooks is never going to be acceptable. >> Generally, the post-commit hooks need to be no-fail, because it's too >> late to throw an ERROR. But there's very little hope that a >> connection to a remote server can be no-fail; anything that involves a >> network connection is, by definition, prone to failure. We can try to >> guarantee that every single bit of code that runs in the path that >> sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an >> ERROR, but that's going to be quite difficult to do: even palloc() can >> throw an error. And what about interrupts? We don't want to be stuck >> inside this code for a long time without any hope of the user >> recovering control of the session by pressing ^C, but of course the >> way that works is it throws an ERROR, which we can't handle here. We >> fixed a similar issue for synchronous replication in >> 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a >> WARNING in that case and then return control to the user. But if we >> do that here, all of the code that every FDW emits has to be aware of >> that rule and follow it, and it just adds to the list of ways that the >> user backend can escape this code without having cleaned up all of the >> prepared transactions on the remote side. > > Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak, > tries to resolve prepared transactions in post-commit code. I agree > with you here, that it should be avoided and the backend should take > over the job of resolving transactions. > >> >> It seems to me that the only way to really make this feature robust is >> to have a background worker as part of the equation. The background >> worker launches at startup and looks around for local state that tells >> it whether there are any COMMIT PREPARED or ROLLBACK PREPARED >> operations pending that weren't completed during the last server >> lifetime, whether because of a local crash or remote unavailability. >> It attempts to complete those and retries periodically. When a new >> transaction needs this type of coordination, it adds the necessary >> crash-proof state and then signals the background worker. If >> appropriate, it can wait for the background worker to complete, just >> like a CHECKPOINT waits for the checkpointer to finish -- but if the >> CHECKPOINT command is interrupted, the actual checkpoint is >> unaffected. > > My patch and hence patch by Masahiko-san and Vinayak have the > background worker in the equation. The background worker tries to > resolve prepared transactions on the foreign server periodically. > IIRC, sending it a signal when another backend creates foreign > prepared transactions is not implemented. That may be a good addition. > >> >> More broadly, the question has been raised as to whether it's right to >> try to handle atomic commit and atomic visibility as two separate >> problems. The XTM API proposed by Postgres Pro aims to address both >> with a single stroke. I don't think that API was well-designed, but >> maybe the idea is good even if the code is not. Generally, there are >> two ways in which you could imagine that a distributed version of >> PostgreSQL might work. One possibility is that one node makes >> everything work by going around and giving instructions to the other >> nodes, which are more or less unaware that they are part of a cluster. >> That is basically the design of Postgres-XC and certainly the design >> being proposed here. The other possibility is that the nodes are >> actually clustered in some way and agree on things like whether a >> transaction committed or what snapshot is current using some kind of >> consensus protocol. It is obviously possible to get a fairly long way >> using the first approach but it seems likely that the second one is >> fundamentally more powerful: among other things, because the first >> approach is so centralized, the leader is apt to become a bottleneck. >> And, quite apart from that, can a centralized architecture with the >> leader manipulating the other workers ever allow for atomic >> visibility? If atomic visibility can build on top of atomic commit, >> then it makes sense to do atomic commit first, but if we build this >> infrastructure and then find that we need an altogether different >> solution for atomic visibility, that will be unfortunate. >> > > There are two problems to solve as far as visibility is concerned. 1. > Consistency: changes by which transactions are visible to a given > transaction 2. Making visible, the changes by all the segments of a > given distributed transaction on different foreign servers, at the > same time IOW no other transaction sees changes by only few segments > but does not see changes by all the transactions. > > First problem is hard to solve and there are many consistency > symantics. A large topic of discussion. > > The second problem can be solved on top of this infrastructure by > extending PREPARE transaction API. I am writing down my ideas so that > they don't get lost. It's not a completed design. > > Assume that we have syntax which tells the originating server which > prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local > server name> with ID <xid> ,where xid is the transaction identifier on > local server. OR we may incorporate that information in GID itself and > the foreign server knows how to decode it. > > Once we have that information, the foreign server can actively poll > the local server to get the status of transaction xid and resolves the > prepared transaction itself. It can go a step further and inform the > local server that it has resolved the transaction, so that the local > server can purge it from it's own state. It can remember the fate of > xid, which can be consulted by another foreign server if the local > server is down. If another transaction on the foreign server stumbles > on a transaction prepared (but not resolved) by the local server, > foreign server has two options - 1. consult the local server and > resolve 2. if the first options fails to get the status of xid or that > if that option is not workable, throw an error e.g. indoubt > transaction. There is probably more network traffic happening here. > Usually, the local server should be able to resolve the transaction > before any other transaction stumbles upon it. The overhead is > incurred only when necessary. > I think we can consider the atomic commit and the atomic visibility separately, and the atomic visibility can build on the top of the atomic commit. We can't provide the atomic visibility across multiple nodes without consistent update. So I'd like to focus on atomic commit in this thread. Considering to providing the atomic commit, the two phase commit protocol is the perfect solution for providing atomic commit. Whatever type of solution for atomic visibility we have, the atomic commit by 2PC is necessary feature. We can consider to have the atomic commit feature that ha following functionalities.* The local node is responsible for the transaction management among relevant remote servers using 2PC.* The local node has information about the state of distributed transaction state.* There is a process resolving in-doubt transaction. As Ashutosh mentioned, current patch supports almost these functionalities. But I'm trying to update it so that it can have multiple foreign server information into one FDWXact file, one entry on shared buffer. Because in spite of that new remote server can be added on the fly, we could need to restart local server in order to allocate the more large shared buffer for fdw transaction whenever remote server is added. Also I'm incorporating other comments. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Oct 21, 2016 at 1:38 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > Once we have that information, the foreign server can actively poll > the local server to get the status of transaction xid and resolves the > prepared transaction itself. It can go a step further and inform the > local server that it has resolved the transaction, so that the local > server can purge it from it's own state. It can remember the fate of > xid, which can be consulted by another foreign server if the local > server is down. If another transaction on the foreign server stumbles > on a transaction prepared (but not resolved) by the local server, > foreign server has two options - 1. consult the local server and > resolve 2. if the first options fails to get the status of xid or that > if that option is not workable, throw an error e.g. indoubt > transaction. There is probably more network traffic happening here. > Usually, the local server should be able to resolve the transaction > before any other transaction stumbles upon it. The overhead is > incurred only when necessary. Yes, something like this could be done. It's pretty complicated, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think we can consider the atomic commit and the atomic visibility > separately, and the atomic visibility can build on the top of the > atomic commit. It is true that we can do that, but I'm not sure whether it's the best design. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I think we can consider the atomic commit and the atomic visibility >> separately, and the atomic visibility can build on the top of the >> atomic commit. > > It is true that we can do that, but I'm not sure whether it's the best design. I'm not sure best design, too. We need to discuss more. But this is not a particular feature for the sharing solution. The atomic commit using 2PC is useful for other servers that can use 2PC, not only postgres_fdw. Attached latest 3 patches that incorporated review comments so far. But recovery speed improvement that is discussed on another thread is not incorporated yet. Please give me feedback. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Mon, Oct 31, 2016 at 6:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> I think we can consider the atomic commit and the atomic visibility >>> separately, and the atomic visibility can build on the top of the >>> atomic commit. >> >> It is true that we can do that, but I'm not sure whether it's the best design. > > I'm not sure best design, too. We need to discuss more. But this is > not a particular feature for the sharing solution. The atomic commit > using 2PC is useful for other servers that can use 2PC, not only > postgres_fdw. > I think, we need to discuss the big picture i.e. architecture for distributed transaction manager for PostgreSQL. Divide it in smaller problems and then solve each of them as series of commits possibly producing a useful feature with each commit. I think, what Robert is pointing out is if we spend time solving smaller problems, we might end up with something which can not be used to solve the bigger problem. Instead, if we define the bigger problem and come up with clear subproblems that when solved would solve the bigger problem, we may not end up in such a situation. There are many distributed transaction models discussed in various papers like [1], [2], [3]. We need to assess which one/s, would suit PostgreSQL FDW infrastructure and may be specifically for postgres_fdw. There is some discussion at [4]. It lists a few approaches, but I could not find a discussion on pros and cons of each of them, and a conclusion as to which of the approaches suits PostgreSQL. May be we want to start that discussion. I know that it's hard to come up with a single model that would suit FDWs or would serve all kinds of applications. We may not be able to support a full distributed transaction manager for every FDW out there. It's possible that because of lack of the big picture, we will not see anything happen in this area for another release. Given that and since all of the models in those papers require 2PC as a basic building block, I was of the opinion that we could at least start with 2PC implementation. But I think request for bigger picture is also valid for reasons stated above. > Attached latest 3 patches that incorporated review comments so far. > But recovery speed improvement that is discussed on another thread is > not incorporated yet. > Please give me feedback. > [1] http://link.springer.com/article/10.1007/s00778-014-0359-9 [2] https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf [3] http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1713&context=cstech [4] https://wiki.postgresql.org/wiki/DTM -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Wed, Nov 2, 2016 at 9:22 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Mon, Oct 31, 2016 at 6:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> I think we can consider the atomic commit and the atomic visibility >>>> separately, and the atomic visibility can build on the top of the >>>> atomic commit. >>> >>> It is true that we can do that, but I'm not sure whether it's the best design. >> >> I'm not sure best design, too. We need to discuss more. But this is >> not a particular feature for the sharing solution. The atomic commit >> using 2PC is useful for other servers that can use 2PC, not only >> postgres_fdw. >> > > I think, we need to discuss the big picture i.e. architecture for > distributed transaction manager for PostgreSQL. Divide it in smaller > problems and then solve each of them as series of commits possibly > producing a useful feature with each commit. I think, what Robert is > pointing out is if we spend time solving smaller problems, we might > end up with something which can not be used to solve the bigger > problem. Instead, if we define the bigger problem and come up with > clear subproblems that when solved would solve the bigger problem, we > may not end up in such a situation. > > There are many distributed transaction models discussed in various > papers like [1], [2], [3]. We need to assess which one/s, would suit > PostgreSQL FDW infrastructure and may be specifically for > postgres_fdw. There is some discussion at [4]. It lists a few > approaches, but I could not find a discussion on pros and cons of each > of them, and a conclusion as to which of the approaches suits > PostgreSQL. May be we want to start that discussion. Agreed. Let's start discussion. I think that it's important to choose what type of transaction coordination we employ; centralized or distributed. > I know that it's hard to come up with a single model that would suit > FDWs or would serve all kinds of applications. We may not be able to > support a full distributed transaction manager for every FDW out > there. It's possible that because of lack of the big picture, we will > not see anything happen in this area for another release. Given that > and since all of the models in those papers require 2PC as a basic > building block, I was of the opinion that we could at least start with > 2PC implementation. But I think request for bigger picture is also > valid for reasons stated above. 2PC is a basic building block to support the atomic commit and there are some optimizations way in order to reduce disadvantage of 2PC. As you mentioned, it's hard to support a single model that would suit several type of FDWs. But even if it's not a purpose for sharding, because many other database which could be connected to PostgreSQL via FDW supports 2PC, 2PC for FDW would be useful for not only sharding purpose. That's why I was focusing on implementing 2PC for FDW so far. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.
Moved to next CF with "needs review" status.
Regards,
Hari Babu
Fujitsu Australia
On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > > On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >> >> >> 2PC is a basic building block to support the atomic commit and there >> are some optimizations way in order to reduce disadvantage of 2PC. As >> you mentioned, it's hard to support a single model that would suit >> several type of FDWs. But even if it's not a purpose for sharding, >> because many other database which could be connected to PostgreSQL via >> FDW supports 2PC, 2PC for FDW would be useful for not only sharding >> purpose. That's why I was focusing on implementing 2PC for FDW so far. > > > Moved to next CF with "needs review" status. I think this should be changed to "returned with feedback.". The design and approach itself needs to be discussed. I think, we should let authors decide whether they want it to be added to the next commitfest or not. When I first started with this work, Tom had suggested me to try to make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or at least postgres_fdw servers work. I think, most of my work that Vinayak and Sawada have rebased to the latest master will be required for getting what Tom suggested done. We wouldn't need a lot of changes to that design. PREPARE involving foreign servers errors out right now. If we start supporting prepared transactions involving foreign servers that will be a good improvement over the current status-quo. Once we get that done, we can continue working on the larger problem of supporting ACID transactions involving foreign servers. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Dec 5, 2016 at 4:42 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:
On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
> wrote:
>>
>>
>> 2PC is a basic building block to support the atomic commit and there
>> are some optimizations way in order to reduce disadvantage of 2PC. As
>> you mentioned, it's hard to support a single model that would suit
>> several type of FDWs. But even if it's not a purpose for sharding,
>> because many other database which could be connected to PostgreSQL via
>> FDW supports 2PC, 2PC for FDW would be useful for not only sharding
>> purpose. That's why I was focusing on implementing 2PC for FDW so far.
>
>
> Moved to next CF with "needs review" status.
I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.
When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.
Thanks for the update.
I closed it in commitfest 2017-01 with "returned with feedback". Author can
update it once the new patch is submitted.
Regards,
Hari Babu
Fujitsu Australia
On 2016/12/05 14:42, Ashutosh Bapat wrote: > On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi > <kommi.haribabu@gmail.com> wrote: > > > On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> > wrote: >>> >>> 2PC is a basic building block to support the atomic commit and there >>> are some optimizations way in order to reduce disadvantage of 2PC. As >>> you mentioned, it's hard to support a single model that would suit >>> several type of FDWs. But even if it's not a purpose for sharding, >>> because many other database which could be connected to PostgreSQL via >>> FDW supports 2PC, 2PC for FDW would be useful for not only sharding >>> purpose. That's why I was focusing on implementing 2PC for FDW so far. >> >> Moved to next CF with "needs review" status. > I think this should be changed to "returned with feedback.". The > design and approach itself needs to be discussed. I think, we should > let authors decide whether they want it to be added to the next > commitfest or not. > > When I first started with this work, Tom had suggested me to try to > make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or > at least postgres_fdw servers work. I think, most of my work that > Vinayak and Sawada have rebased to the latest master will be required > for getting what Tom suggested done. We wouldn't need a lot of changes > to that design. PREPARE involving foreign servers errors out right > now. If we start supporting prepared transactions involving foreign > servers that will be a good improvement over the current status-quo. > Once we get that done, we can continue working on the larger problem > of supporting ACID transactions involving foreign servers. In the pgconf ASIA depelopers meeting Bruce Momjian and other developers discussed on FDW based sharding [1]. The suggestions from other hackers was that we need to discuss the big picture and use cases of sharding. Bruce has listed all the building blocks of built-in sharding on wiki [2]. IIUC,transaction manager involving foreign servers is one part of sharding. As per the Bruce's wiki page there are two use cases for transactions involved multiple foreign servers: 1. Cross-node read-only queries on read/write shards: This will require a global snapshot manager to make sure the shards return consistent data. 2. Cross-node read-write queries: This will require a global snapshot manager and global transaction manager. I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers that will be good improvement. [1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting [2] https://wiki.postgresql.org/wiki/Built-in_Sharding Regards, Vinayak Pokale NTT Opern Source Software Center
On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: > On 2016/12/05 14:42, Ashutosh Bapat wrote: >> >> On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi >> <kommi.haribabu@gmail.com> wrote: >> >> >> On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> wrote: >>>> >>>> >>>> 2PC is a basic building block to support the atomic commit and there >>>> are some optimizations way in order to reduce disadvantage of 2PC. As >>>> you mentioned, it's hard to support a single model that would suit >>>> several type of FDWs. But even if it's not a purpose for sharding, >>>> because many other database which could be connected to PostgreSQL via >>>> FDW supports 2PC, 2PC for FDW would be useful for not only sharding >>>> purpose. That's why I was focusing on implementing 2PC for FDW so far. >>> >>> >>> Moved to next CF with "needs review" status. >> >> I think this should be changed to "returned with feedback.". The >> design and approach itself needs to be discussed. I think, we should >> let authors decide whether they want it to be added to the next >> commitfest or not. >> >> When I first started with this work, Tom had suggested me to try to >> make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or >> at least postgres_fdw servers work. I think, most of my work that >> Vinayak and Sawada have rebased to the latest master will be required >> for getting what Tom suggested done. We wouldn't need a lot of changes >> to that design. PREPARE involving foreign servers errors out right >> now. If we start supporting prepared transactions involving foreign >> servers that will be a good improvement over the current status-quo. >> Once we get that done, we can continue working on the larger problem >> of supporting ACID transactions involving foreign servers. > > In the pgconf ASIA depelopers meeting Bruce Momjian and other developers > discussed > on FDW based sharding [1]. The suggestions from other hackers was that we > need to discuss > the big picture and use cases of sharding. Bruce has listed all the building > blocks of built-in sharding > on wiki [2]. IIUC,transaction manager involving foreign servers is one part > of sharding. Yeah, the 2PC on FDW is a basic building block for FDW based sharding and it would be useful not only FDW sharding but also other purposes. As far as I surveyed some papers the many kinds of distributed transaction management architectures use the 2PC for atomic commit with some optimisations. And using 2PC to provide atomic commit on distributed transaction has much affinity with current PostgreSQL implementation from some perspective. > As per the Bruce's wiki page there are two use cases for transactions > involved multiple foreign servers: > 1. Cross-node read-only queries on read/write shards: > This will require a global snapshot manager to make sure the shards > return consistent data. > 2. Cross-node read-write queries: > This will require a global snapshot manager and global transaction > manager. > > I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK > PREPARED > involving foreign servers that will be good improvement. > > [1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting > [2] https://wiki.postgresql.org/wiki/Built-in_Sharding > I also agree to work on implementing the atomic commit across the foreign servers and then continue to work on the more larger problem. I think that this will be large step forward. I'm going to submit the updated version patch to CF3. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Dec 9, 2016 at 4:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >> On 2016/12/05 14:42, Ashutosh Bapat wrote: >>> >>> On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi >>> <kommi.haribabu@gmail.com> wrote: >>> >>> >>> On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> >>> wrote: >>>>> >>>>> >>>>> 2PC is a basic building block to support the atomic commit and there >>>>> are some optimizations way in order to reduce disadvantage of 2PC. As >>>>> you mentioned, it's hard to support a single model that would suit >>>>> several type of FDWs. But even if it's not a purpose for sharding, >>>>> because many other database which could be connected to PostgreSQL via >>>>> FDW supports 2PC, 2PC for FDW would be useful for not only sharding >>>>> purpose. That's why I was focusing on implementing 2PC for FDW so far. >>>> >>>> >>>> Moved to next CF with "needs review" status. >>> >>> I think this should be changed to "returned with feedback.". The >>> design and approach itself needs to be discussed. I think, we should >>> let authors decide whether they want it to be added to the next >>> commitfest or not. >>> >>> When I first started with this work, Tom had suggested me to try to >>> make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or >>> at least postgres_fdw servers work. I think, most of my work that >>> Vinayak and Sawada have rebased to the latest master will be required >>> for getting what Tom suggested done. We wouldn't need a lot of changes >>> to that design. PREPARE involving foreign servers errors out right >>> now. If we start supporting prepared transactions involving foreign >>> servers that will be a good improvement over the current status-quo. >>> Once we get that done, we can continue working on the larger problem >>> of supporting ACID transactions involving foreign servers. >> >> In the pgconf ASIA depelopers meeting Bruce Momjian and other developers >> discussed >> on FDW based sharding [1]. The suggestions from other hackers was that we >> need to discuss >> the big picture and use cases of sharding. Bruce has listed all the building >> blocks of built-in sharding >> on wiki [2]. IIUC,transaction manager involving foreign servers is one part >> of sharding. > > Yeah, the 2PC on FDW is a basic building block for FDW based sharding > and it would be useful not only FDW sharding but also other purposes. > As far as I surveyed some papers the many kinds of distributed > transaction management architectures use the 2PC for atomic commit > with some optimisations. And using 2PC to provide atomic commit on > distributed transaction has much affinity with current PostgreSQL > implementation from some perspective. > >> As per the Bruce's wiki page there are two use cases for transactions >> involved multiple foreign servers: >> 1. Cross-node read-only queries on read/write shards: >> This will require a global snapshot manager to make sure the shards >> return consistent data. >> 2. Cross-node read-write queries: >> This will require a global snapshot manager and global transaction >> manager. >> >> I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK >> PREPARED >> involving foreign servers that will be good improvement. >> >> [1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting >> [2] https://wiki.postgresql.org/wiki/Built-in_Sharding >> > > I also agree to work on implementing the atomic commit across the > foreign servers and then continue to work on the more larger problem. > I think that this will be large step forward. I'm going to submit the > updated version patch to CF3. Attached latest version patches. Almost design is the same as previous patches and I incorporated some optimisations and updated documentation. But the documentation and regression test is not still enough. 000 patch adds some new FDW APIs to achive the atomic commit involving the foreign servers using two-phase-commit. If more than one foreign servers involve with the transaction or the transaction changes local data and involves even one foreign server, local node executes PREPARE and COMMIT/ROLLBACK PREPARED on foreign servers at commit. A lot of part of this implementation is inspired by two phase commit code. So I incorporated recent changes of two phase commit code, for example recovery speed improvement, into this patch. 001 patch makes postgres_fdw support atomic commit. If two_phase_commit is set 'on' to a foreign server, the two-phase-commit will be used at commit. 002 patch adds the pg_fdw_resolver new contrib module that is a bgworker process that resolves the in-doubt transaction on foreign server if there is. The reply might be late next week but feedback and review comment are very welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Dec 23, 2016 at 1:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Dec 9, 2016 at 4:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >>> On 2016/12/05 14:42, Ashutosh Bapat wrote: >>>> >>>> On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi >>>> <kommi.haribabu@gmail.com> wrote: >>>> >>>> >>>> On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com> >>>> wrote: >>>>>> >>>>>> >>>>>> 2PC is a basic building block to support the atomic commit and there >>>>>> are some optimizations way in order to reduce disadvantage of 2PC. As >>>>>> you mentioned, it's hard to support a single model that would suit >>>>>> several type of FDWs. But even if it's not a purpose for sharding, >>>>>> because many other database which could be connected to PostgreSQL via >>>>>> FDW supports 2PC, 2PC for FDW would be useful for not only sharding >>>>>> purpose. That's why I was focusing on implementing 2PC for FDW so far. >>>>> >>>>> >>>>> Moved to next CF with "needs review" status. >>>> >>>> I think this should be changed to "returned with feedback.". The >>>> design and approach itself needs to be discussed. I think, we should >>>> let authors decide whether they want it to be added to the next >>>> commitfest or not. >>>> >>>> When I first started with this work, Tom had suggested me to try to >>>> make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or >>>> at least postgres_fdw servers work. I think, most of my work that >>>> Vinayak and Sawada have rebased to the latest master will be required >>>> for getting what Tom suggested done. We wouldn't need a lot of changes >>>> to that design. PREPARE involving foreign servers errors out right >>>> now. If we start supporting prepared transactions involving foreign >>>> servers that will be a good improvement over the current status-quo. >>>> Once we get that done, we can continue working on the larger problem >>>> of supporting ACID transactions involving foreign servers. >>> >>> In the pgconf ASIA depelopers meeting Bruce Momjian and other developers >>> discussed >>> on FDW based sharding [1]. The suggestions from other hackers was that we >>> need to discuss >>> the big picture and use cases of sharding. Bruce has listed all the building >>> blocks of built-in sharding >>> on wiki [2]. IIUC,transaction manager involving foreign servers is one part >>> of sharding. >> >> Yeah, the 2PC on FDW is a basic building block for FDW based sharding >> and it would be useful not only FDW sharding but also other purposes. >> As far as I surveyed some papers the many kinds of distributed >> transaction management architectures use the 2PC for atomic commit >> with some optimisations. And using 2PC to provide atomic commit on >> distributed transaction has much affinity with current PostgreSQL >> implementation from some perspective. >> >>> As per the Bruce's wiki page there are two use cases for transactions >>> involved multiple foreign servers: >>> 1. Cross-node read-only queries on read/write shards: >>> This will require a global snapshot manager to make sure the shards >>> return consistent data. >>> 2. Cross-node read-write queries: >>> This will require a global snapshot manager and global transaction >>> manager. >>> >>> I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK >>> PREPARED >>> involving foreign servers that will be good improvement. >>> >>> [1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting >>> [2] https://wiki.postgresql.org/wiki/Built-in_Sharding >>> >> >> I also agree to work on implementing the atomic commit across the >> foreign servers and then continue to work on the more larger problem. >> I think that this will be large step forward. I'm going to submit the >> updated version patch to CF3. > > Attached latest version patches. Almost design is the same as previous > patches and I incorporated some optimisations and updated > documentation. But the documentation and regression test is not still > enough. > > 000 patch adds some new FDW APIs to achive the atomic commit involving > the foreign servers using two-phase-commit. If more than one foreign > servers involve with the transaction or the transaction changes local > data and involves even one foreign server, local node executes PREPARE > and COMMIT/ROLLBACK PREPARED on foreign servers at commit. A lot of > part of this implementation is inspired by two phase commit code. So I > incorporated recent changes of two phase commit code, for example > recovery speed improvement, into this patch. > 001 patch makes postgres_fdw support atomic commit. If > two_phase_commit is set 'on' to a foreign server, the two-phase-commit > will be used at commit. 002 patch adds the pg_fdw_resolver new contrib > module that is a bgworker process that resolves the in-doubt > transaction on foreign server if there is. > > The reply might be late next week but feedback and review comment are > very welcome. > Long time passed since original patch proposed by Ashutosh, so I explain again about current design and functionality of this feature. If you have any question, please feel free to ask. Parameters ========== The patch introduces max_foreign_prepared_transactions parameter and two_phase_commit parameter. two_phase_commit parameter is a new foreign server parameter, which means that specified foreign server is capable of two phase commit protocol. The modification transaction could be committed using two phase commit protocol on foreign server with two_phase_commit = on. We can set this parameter by CREATE/ALTER SERVER command. max_foreign_prepared_transactions is a new GUC parameter, which controls the upper bound of the number of transaction on foreign servers the local transaction prepares. Note that it does not control the number of transactions on local server that involves foreign server. Since one transaction could prepare transaction on multiple foreign servers, max_foreign_prepared_transactions should be set at least ((max_connections) * (the number of foreign server with two_phase_commit = on)). Changing this parameter requires restart. Cluster-wide atomic commit ======================= Since the distributed transaction commit on foreign servers are executed independently, the transaction that modified data on the multiple foreign servers is not ensured that transaction did either all of them commit or all of them rollback. The patch adds the functionality that guarantees distributed transaction did either commit or rollback on all foreign servers. IOW the goal of this patch is achieving the cluster-wide atomic commit across foreign server that is capable two phase commit protocol. If the transaction modifies data on multiple foreign servers and does COMMIT then the transaction is committed or rollback-ed on foreign servers using two phase commit protocol implicitly. Transaction is committed or rollback-ed using two phase commit protocol in following cases. * The transaction changes local data. * The transaction changes data on more than one foreignserver whose two_phase_commit is on. In order to manage foreign transaction, the patch changes PostgreSQL core so that it keeps track of foreign transaction. These entry is exists on shared buffer but it's written to fdw_xact file in $PGDATA/fdw_xact directory by checkpoint. We can check all foreign transaction entries via pg_fdw_xacts system view. The commit of distributed transaction using two phase commit protocol is executed as follows; In 1st phase, every foreign server with two_phase_commit = on needs to register the connection to MyFDWConnection while starting new transaction on a foreign connection using RegisterXactForeignServer(). During pre-commit phase following steps are executed. 1. Get transaction identifier used for PREPARE TRANSACTION on foreign servers. 2. Execute COMMIT on foreign server with two_phase_commit = off. 3. Register fdw_xact entry into shared memory and write XLOG_FDW_XACT_INSERT WAL. 4. Execute PREPARE TRANSACTION on foreign server with two_phase_commit = on. After that, local changes is committed (calls RecordTransactionCommit()). Meantime of phase 1 and local commit, the transaction could be failed due to serialization failure and pre-commit of notify. In such case, all foreign transactions are rollback-ed. In 2nd phase, foreign transaction on foreign server with two_phase_commit = off are already finished in 1st phase, so we focus on only the foreign server with two_phase_commit = on. During commit phase following steps are executed. 1. Resolve foreign prepared transaction. 2. Remove foreign transaction entry and write XLOG_FDW_XACT_REMOVE WAL. In case server crashes after step 1 and before step 2, a resolved foreign transaction will be considered unresolved when the local server recovers or standby takes over the master. It will try to resolve the prepared transaction again and should get an error from foreign server. Crash recovery ============= During crash recovery, the fdw_xact entry are inserted to KnownFDWXactList or removed from KnownFDWXact list when corresponding WAL records are replayed. After the redo is done fdw_xact file is re-created and then pg_fdw_xact directory is scanned for unresolved foreign prepared transactions. The files in this directory are named as triplet (xid, foreign server oid, user oid) to create a unique name for each file. This scan also emits the oldest transaction id with an unresolved prepared foreign transactions. This affects oldest active transaction id, since the status of this transaction id is required to decide the fate of unresolved prepared foreign transaction. On standby during WAL replay files are just inserted or removed. If the standby is required to finish recovery and take over the master, pg_fdw_xact is scanned to read unresolved foreign prepared transactions into the shared memory. Many of fdw_xact.c code is inspired by two_phase.c code. So recovery mechanism and process are almost same as two_phase. The patch incorporated recent optimization of two_phase.c. Handling in-doubt transaction ======================== Any crash or connection failure in phase 2 leaves the prepared transaction in unresolved state (called the in-doubt transaction). We need to resolve the in-doubt transaction after foreign server recovered. We can do that manually by calling pg_fdw_xact_resolve function on local server but the patch introduces new contrib module pg_fdw_resolver in order to handle them automatically. pg_fdw_resolver is a background worker process, which periodically checks if there is in-doubt transaction and tries to resolve such transaction. FDW APIs ====== The patch introduces new four FDW APIs; GetPrepreId, EndForeignTransaction, PrepareForeignTransaction and ResolvePrepareForeginTransaction * GetPreparedId is called to get transaction identifier on pre-commit phase. * EndForeignTransaction is called on commit phase and executes either COMMIT or ROLLBACK on foreign server. * PrepareForeignTransaciton is called on pre-commit phase and executes PREPARE TRANSACTION on foreign server. * ResolvePrepareForeginTransaction is called on commit phase and execute either COMMIT PREPARED or ROLLBACK PREPARED with given transaction identifier on foreign server. If the foreign server is not capable of two phase commit, last two APIs are not required. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
>> > > Long time passed since original patch proposed by Ashutosh, so I > explain again about current design and functionality of this feature. > If you have any question, please feel free to ask. Thanks for the summary. > > Parameters > ========== [ snip ] > > Cluster-wide atomic commit > ======================= > Since the distributed transaction commit on foreign servers are > executed independently, the transaction that modified data on the > multiple foreign servers is not ensured that transaction did either > all of them commit or all of them rollback. The patch adds the > functionality that guarantees distributed transaction did either > commit or rollback on all foreign servers. IOW the goal of this patch > is achieving the cluster-wide atomic commit across foreign server that > is capable two phase commit protocol. In [1], I proposed that we solve the problem of supporting PREPARED transactions involving foreign servers and in subsequent mail Vinayak agreed to that. But this goal has wider scope than that proposal. I am fine widening the scope, but then it would again lead to the same discussion we had about the big picture. May be you want to share design (or point out the parts of this design that will help) for solving smaller problem and tone down the patch for the same. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: >>> >> >> Long time passed since original patch proposed by Ashutosh, so I >> explain again about current design and functionality of this feature. >> If you have any question, please feel free to ask. > > Thanks for the summary. > >> >> Parameters >> ========== > > [ snip ] > >> >> Cluster-wide atomic commit >> ======================= >> Since the distributed transaction commit on foreign servers are >> executed independently, the transaction that modified data on the >> multiple foreign servers is not ensured that transaction did either >> all of them commit or all of them rollback. The patch adds the >> functionality that guarantees distributed transaction did either >> commit or rollback on all foreign servers. IOW the goal of this patch >> is achieving the cluster-wide atomic commit across foreign server that >> is capable two phase commit protocol. > > In [1], I proposed that we solve the problem of supporting PREPARED > transactions involving foreign servers and in subsequent mail Vinayak > agreed to that. But this goal has wider scope than that proposal. I am > fine widening the scope, but then it would again lead to the same > discussion we had about the big picture. May be you want to share > design (or point out the parts of this design that will help) for > solving smaller problem and tone down the patch for the same. > Sorry for confuse you. I'm still focusing on solving only that problem. What I was trying to say is that I think that supporting PREPARED transaction involving foreign server is the means, not the end. So once we supports PREPARED transaction involving foreign servers we can achieve cluster-wide atomic commit in a sense. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >>>> >>> >>> Long time passed since original patch proposed by Ashutosh, so I >>> explain again about current design and functionality of this feature. >>> If you have any question, please feel free to ask. >> >> Thanks for the summary. >> >>> >>> Parameters >>> ========== >> >> [ snip ] >> >>> >>> Cluster-wide atomic commit >>> ======================= >>> Since the distributed transaction commit on foreign servers are >>> executed independently, the transaction that modified data on the >>> multiple foreign servers is not ensured that transaction did either >>> all of them commit or all of them rollback. The patch adds the >>> functionality that guarantees distributed transaction did either >>> commit or rollback on all foreign servers. IOW the goal of this patch >>> is achieving the cluster-wide atomic commit across foreign server that >>> is capable two phase commit protocol. >> >> In [1], I proposed that we solve the problem of supporting PREPARED >> transactions involving foreign servers and in subsequent mail Vinayak >> agreed to that. But this goal has wider scope than that proposal. I am >> fine widening the scope, but then it would again lead to the same >> discussion we had about the big picture. May be you want to share >> design (or point out the parts of this design that will help) for >> solving smaller problem and tone down the patch for the same. >> > > Sorry for confuse you. I'm still focusing on solving only that > problem. What I was trying to say is that I think that supporting > PREPARED transaction involving foreign server is the means, not the > end. So once we supports PREPARED transaction involving foreign > servers we can achieve cluster-wide atomic commit in a sense. > Attached updated patches. I fixed some bugs and add 003 patch that adds TAP test for foreign transaction. 003 patch depends 000 and 001 patch. Please give me feedback. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/01/16 17:35, Masahiko Sawada wrote: > On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >>>> Long time passed since original patch proposed by Ashutosh, so I >>>> explain again about current design and functionality of this feature. >>>> If you have any question, please feel free to ask. >>> Thanks for the summary. >>> >>>> Parameters >>>> ========== >>> [ snip ] >>> >>>> Cluster-wide atomic commit >>>> ======================= >>>> Since the distributed transaction commit on foreign servers are >>>> executed independently, the transaction that modified data on the >>>> multiple foreign servers is not ensured that transaction did either >>>> all of them commit or all of them rollback. The patch adds the >>>> functionality that guarantees distributed transaction did either >>>> commit or rollback on all foreign servers. IOW the goal of this patch >>>> is achieving the cluster-wide atomic commit across foreign server that >>>> is capable two phase commit protocol. >>> In [1], I proposed that we solve the problem of supporting PREPARED >>> transactions involving foreign servers and in subsequent mail Vinayak >>> agreed to that. But this goal has wider scope than that proposal. I am >>> fine widening the scope, but then it would again lead to the same >>> discussion we had about the big picture. May be you want to share >>> design (or point out the parts of this design that will help) for >>> solving smaller problem and tone down the patch for the same. >>> >> Sorry for confuse you. I'm still focusing on solving only that >> problem. What I was trying to say is that I think that supporting >> PREPARED transaction involving foreign server is the means, not the >> end. So once we supports PREPARED transaction involving foreign >> servers we can achieve cluster-wide atomic commit in a sense. >> > Attached updated patches. I fixed some bugs and add 003 patch that > adds TAP test for foreign transaction. > 003 patch depends 000 and 001 patch. > > Please give me feedback. I have tested prepared transactions with foreign servers but after preparing the transaction the following error occur infinitely. Test: ===== =#BEGIN; =#INSERT INTO ft1_lt VALUES (10); =#INSERT INTO ft2_lt VALUES (20); =#PREPARE TRANSACTION 'prep_xact_with_fdw'; 2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve() does not exist at character 8 2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given name and argument types. You might need to add explicit type casts. 2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve() 2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign transaction resolver (dbid 13119) (PID 4312) exited with exit code 1 ..... If we check the status on another session then it showing the status as prepared. =# select * from pg_fdw_xacts; dbid | transaction | serverid | userid | status | identifier -------+-------------+----------+--------+----------+------------------------ 13119 | 1688 | 16388 | 10 | prepared | px_2102366504_16388_10 13119 | 1688 | 16389 | 10 | prepared | px_749056984_16389_10 (2 rows) I think this is a bug. Regards, Vinayak Pokale NTT Open Source Software Center
On Thu, Jan 19, 2017 at 4:04 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: > > On 2017/01/16 17:35, Masahiko Sawada wrote: >> >> On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> >> wrote: >>> >>> On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>> >>>>> Long time passed since original patch proposed by Ashutosh, so I >>>>> explain again about current design and functionality of this feature. >>>>> If you have any question, please feel free to ask. >>>> >>>> Thanks for the summary. >>>> >>>>> Parameters >>>>> ========== >>>> >>>> [ snip ] >>>> >>>>> Cluster-wide atomic commit >>>>> ======================= >>>>> Since the distributed transaction commit on foreign servers are >>>>> executed independently, the transaction that modified data on the >>>>> multiple foreign servers is not ensured that transaction did either >>>>> all of them commit or all of them rollback. The patch adds the >>>>> functionality that guarantees distributed transaction did either >>>>> commit or rollback on all foreign servers. IOW the goal of this patch >>>>> is achieving the cluster-wide atomic commit across foreign server that >>>>> is capable two phase commit protocol. >>>> >>>> In [1], I proposed that we solve the problem of supporting PREPARED >>>> transactions involving foreign servers and in subsequent mail Vinayak >>>> agreed to that. But this goal has wider scope than that proposal. I am >>>> fine widening the scope, but then it would again lead to the same >>>> discussion we had about the big picture. May be you want to share >>>> design (or point out the parts of this design that will help) for >>>> solving smaller problem and tone down the patch for the same. >>>> >>> Sorry for confuse you. I'm still focusing on solving only that >>> problem. What I was trying to say is that I think that supporting >>> PREPARED transaction involving foreign server is the means, not the >>> end. So once we supports PREPARED transaction involving foreign >>> servers we can achieve cluster-wide atomic commit in a sense. >>> >> Attached updated patches. I fixed some bugs and add 003 patch that >> adds TAP test for foreign transaction. >> 003 patch depends 000 and 001 patch. >> >> Please give me feedback. > > > I have tested prepared transactions with foreign servers but after preparing > the transaction > the following error occur infinitely. > Test: > ===== > =#BEGIN; > =#INSERT INTO ft1_lt VALUES (10); > =#INSERT INTO ft2_lt VALUES (20); > =#PREPARE TRANSACTION 'prep_xact_with_fdw'; > > 2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve() does > not exist at character 8 > 2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given name > and argument types. You might need to add explicit type casts. > 2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve() > 2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign > transaction resolver (dbid 13119) (PID 4312) exited with exit code 1 > ..... > > If we check the status on another session then it showing the status as > prepared. > =# select * from pg_fdw_xacts; > dbid | transaction | serverid | userid | status | identifier > -------+-------------+----------+--------+----------+------------------------ > 13119 | 1688 | 16388 | 10 | prepared | px_2102366504_16388_10 > 13119 | 1688 | 16389 | 10 | prepared | px_749056984_16389_10 > (2 rows) > > I think this is a bug. > Thank you for reviewing! I think this is a bug of pg_fdw_resolver contrib module. I had forgotten to change the SQL executed by pg_fdw_resolver process. Attached latest version 002 patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Jan 19, 2017 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Jan 19, 2017 at 4:04 PM, vinayak > <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >> >> On 2017/01/16 17:35, Masahiko Sawada wrote: >>> >>> On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> >>> wrote: >>>> >>>> On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat >>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>>>> >>>>>> Long time passed since original patch proposed by Ashutosh, so I >>>>>> explain again about current design and functionality of this feature. >>>>>> If you have any question, please feel free to ask. >>>>> >>>>> Thanks for the summary. >>>>> >>>>>> Parameters >>>>>> ========== >>>>> >>>>> [ snip ] >>>>> >>>>>> Cluster-wide atomic commit >>>>>> ======================= >>>>>> Since the distributed transaction commit on foreign servers are >>>>>> executed independently, the transaction that modified data on the >>>>>> multiple foreign servers is not ensured that transaction did either >>>>>> all of them commit or all of them rollback. The patch adds the >>>>>> functionality that guarantees distributed transaction did either >>>>>> commit or rollback on all foreign servers. IOW the goal of this patch >>>>>> is achieving the cluster-wide atomic commit across foreign server that >>>>>> is capable two phase commit protocol. >>>>> >>>>> In [1], I proposed that we solve the problem of supporting PREPARED >>>>> transactions involving foreign servers and in subsequent mail Vinayak >>>>> agreed to that. But this goal has wider scope than that proposal. I am >>>>> fine widening the scope, but then it would again lead to the same >>>>> discussion we had about the big picture. May be you want to share >>>>> design (or point out the parts of this design that will help) for >>>>> solving smaller problem and tone down the patch for the same. >>>>> >>>> Sorry for confuse you. I'm still focusing on solving only that >>>> problem. What I was trying to say is that I think that supporting >>>> PREPARED transaction involving foreign server is the means, not the >>>> end. So once we supports PREPARED transaction involving foreign >>>> servers we can achieve cluster-wide atomic commit in a sense. >>>> >>> Attached updated patches. I fixed some bugs and add 003 patch that >>> adds TAP test for foreign transaction. >>> 003 patch depends 000 and 001 patch. >>> >>> Please give me feedback. >> >> >> I have tested prepared transactions with foreign servers but after preparing >> the transaction >> the following error occur infinitely. >> Test: >> ===== >> =#BEGIN; >> =#INSERT INTO ft1_lt VALUES (10); >> =#INSERT INTO ft2_lt VALUES (20); >> =#PREPARE TRANSACTION 'prep_xact_with_fdw'; >> >> 2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve() does >> not exist at character 8 >> 2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given name >> and argument types. You might need to add explicit type casts. >> 2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve() >> 2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign >> transaction resolver (dbid 13119) (PID 4312) exited with exit code 1 >> ..... >> >> If we check the status on another session then it showing the status as >> prepared. >> =# select * from pg_fdw_xacts; >> dbid | transaction | serverid | userid | status | identifier >> -------+-------------+----------+--------+----------+------------------------ >> 13119 | 1688 | 16388 | 10 | prepared | px_2102366504_16388_10 >> 13119 | 1688 | 16389 | 10 | prepared | px_749056984_16389_10 >> (2 rows) >> >> I think this is a bug. >> > > Thank you for reviewing! > > I think this is a bug of pg_fdw_resolver contrib module. I had > forgotten to change the SQL executed by pg_fdw_resolver process. > Attached latest version 002 patch. > As previous version patch conflicts to current HEAD, attached updated version patches. Also I fixed some bugs in pg_fdw_xact_resolver and added some documentations. Please review it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi Sawada-san,
On 2017/01/26 16:51, Masahiko Sawada wrote:
Thank you updating the patches.Thank you for reviewing! I think this is a bug of pg_fdw_resolver contrib module. I had forgotten to change the SQL executed by pg_fdw_resolver process. Attached latest version 002 patch.As previous version patch conflicts to current HEAD, attached updated version patches. Also I fixed some bugs in pg_fdw_xact_resolver and added some documentations. Please review it.
I have applied patches on Postgres HEAD.
I have created the postgres=fdw extension in PostgreSQL and then I got segmentation fault.
Details:
=# 2017-01-26 17:52:56.156 JST [3411] LOG: worker process: foreign transaction resolver launcher (PID 3418) was terminated by signal 11: Segmentation fault
2017-01-26 17:52:56.156 JST [3411] LOG: terminating any other active server processes
2017-01-26 17:52:56.156 JST [3425] WARNING: terminating connection because of crash of another server process
2017-01-26 17:52:56.156 JST [3425] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-01-26 17:52:56.156 JST [3425] HINT: In a moment you should be able to reconnect to the database and repeat your command.
Is this a bug?
Regards,
Vinayak Pokale
NTT Open Source Software Center
On Thu, Jan 26, 2017 at 6:04 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: > Hi Sawada-san, > > On 2017/01/26 16:51, Masahiko Sawada wrote: > > Thank you for reviewing! > > I think this is a bug of pg_fdw_resolver contrib module. I had > forgotten to change the SQL executed by pg_fdw_resolver process. > Attached latest version 002 patch. > > As previous version patch conflicts to current HEAD, attached updated > version patches. Also I fixed some bugs in pg_fdw_xact_resolver and > added some documentations. > Please review it. > > Thank you updating the patches. > > I have applied patches on Postgres HEAD. > I have created the postgres=fdw extension in PostgreSQL and then I got > segmentation fault. > Details: > =# 2017-01-26 17:52:56.156 JST [3411] LOG: worker process: foreign > transaction resolver launcher (PID 3418) was terminated by signal 11: > Segmentation fault > 2017-01-26 17:52:56.156 JST [3411] LOG: terminating any other active server > processes > 2017-01-26 17:52:56.156 JST [3425] WARNING: terminating connection because > of crash of another server process > 2017-01-26 17:52:56.156 JST [3425] DETAIL: The postmaster has commanded > this server process to roll back the current transaction and exit, because > another server process exited abnormally and possibly corrupted shared > memory. > 2017-01-26 17:52:56.156 JST [3425] HINT: In a moment you should be able to > reconnect to the database and repeat your command. > > Is this a bug? > Thank you for testing! Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please use attached patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 1/26/17 4:49 AM, Masahiko Sawada wrote: > Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please > use attached patch. So in some other thread we are talking about renaming "xlog", because nobody knows what the "x" means. In the spirit of that, let's find better names for new functions as well. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017/01/29 0:11, Peter Eisentraut wrote: > On 1/26/17 4:49 AM, Masahiko Sawada wrote: >> Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please >> use attached patch. > So in some other thread we are talking about renaming "xlog", because > nobody knows what the "x" means. In the spirit of that, let's find > better names for new functions as well. +1 Regards, Vinayak Pokale NTT Open Source Software Center
On Sat, Jan 28, 2017 at 8:41 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 1/26/17 4:49 AM, Masahiko Sawada wrote: >> Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please >> use attached patch. > > So in some other thread we are talking about renaming "xlog", because > nobody knows what the "x" means. In the spirit of that, let's find > better names for new functions as well. It's common in English (not just the database jargon) to abbreviate "trans" by "x" [1]. xlog went a bit far by abbreviating whole "transaction" by "x". But here "xact" means "transact", which is fine. May be we should use 'X' instead of 'x', I don't know. Said that, I am fine with any other name which conveys what the function does. [1] https://en.wikipedia.org/wiki/X -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Jan 30, 2017 at 12:50 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Sat, Jan 28, 2017 at 8:41 PM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> On 1/26/17 4:49 AM, Masahiko Sawada wrote: >>> Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please >>> use attached patch. >> >> So in some other thread we are talking about renaming "xlog", because >> nobody knows what the "x" means. In the spirit of that, let's find >> better names for new functions as well. > > It's common in English (not just the database jargon) to abbreviate > "trans" by "x" [1]. xlog went a bit far by abbreviating whole > "transaction" by "x". But here "xact" means "transact", which is fine. > May be we should use 'X' instead of 'x', I don't know. Said that, I am > fine with any other name which conveys what the function does. > > [1] https://en.wikipedia.org/wiki/X > "txn" can be used for abbreviation of "Transaction", so for example pg_fdw_txn_resolver? I'm also fine to change the module and function name. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jan 26, 2017 at 6:49 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please > use attached patch. This patch has been moved to CF 2017-03. -- Michael
On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > "txn" can be used for abbreviation of "Transaction", so for example > pg_fdw_txn_resolver? > I'm also fine to change the module and function name. If we're judging the relative clarity of various ways of abbreviating the word "transaction", "txn" surely beats "x". To repeat my usual refrain, is there any merit to abbreviating at all?Could we call it, say, "fdw_transaction_resolver" or "fdw_transaction_manager"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> "txn" can be used for abbreviation of "Transaction", so for example >> pg_fdw_txn_resolver? >> I'm also fine to change the module and function name. > > If we're judging the relative clarity of various ways of abbreviating > the word "transaction", "txn" surely beats "x". > > To repeat my usual refrain, is there any merit to abbreviating at all? > Could we call it, say, "fdw_transaction_resolver" or > "fdw_transaction_manager"? > Almost modules in contrib are name with "pg_" prefix but I prefer "fdw_transcation_resolver" if we don't need "pg_" prefix. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Feb 6, 2017 at 10:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> "txn" can be used for abbreviation of "Transaction", so for example >>> pg_fdw_txn_resolver? >>> I'm also fine to change the module and function name. >> >> If we're judging the relative clarity of various ways of abbreviating >> the word "transaction", "txn" surely beats "x". >> >> To repeat my usual refrain, is there any merit to abbreviating at all? >> Could we call it, say, "fdw_transaction_resolver" or >> "fdw_transaction_manager"? >> > > Almost modules in contrib are name with "pg_" prefix but I prefer > "fdw_transcation_resolver" if we don't need "pg_" prefix. > Since previous patches conflict to current HEAD, attached latest version patches. Please review them. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Feb 15, 2017 at 3:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Feb 6, 2017 at 10:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> "txn" can be used for abbreviation of "Transaction", so for example >>>> pg_fdw_txn_resolver? >>>> I'm also fine to change the module and function name. >>> >>> If we're judging the relative clarity of various ways of abbreviating >>> the word "transaction", "txn" surely beats "x". >>> >>> To repeat my usual refrain, is there any merit to abbreviating at all? >>> Could we call it, say, "fdw_transaction_resolver" or >>> "fdw_transaction_manager"? >>> >> >> Almost modules in contrib are name with "pg_" prefix but I prefer >> "fdw_transcation_resolver" if we don't need "pg_" prefix. >> > > Since previous patches conflict to current HEAD, attached latest > version patches. > Please review them. > I've created a wiki page[1] describing about the design and functionality of this feature. Also it has some examples of use case, so this page would be helpful for even testing. Please refer it if you're interested in testing this feature. [1] 2PC on FDW <https://wiki.postgresql.org/wiki/2PC_on_FDW> Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 2017/02/28 16:54, Masahiko Sawada wrote:
Thank you for creating the wiki page.I've created a wiki page[1] describing about the design and functionality of this feature. Also it has some examples of use case, so this page would be helpful for even testing. Please refer it if you're interested in testing this feature. [1] 2PC on FDW <https://wiki.postgresql.org/wiki/2PC_on_FDW>
In the "src/test/regress/pg_regress.c" file
- * xacts. (Note: to reduce the probability of unexpected shmmax
- * failures, don't set max_prepared_transactions any higher than
- * actually needed by the prepared_xacts regression test.)
+ * xacts. We also set max_fdw_transctions to enable testing of atomic
+ * foreign transactions. (Note: to reduce the probability of unexpected
+ * shmmax failures, don't set max_prepared_transactions or
+ * max_prepared_foreign_transactions any higher than actually needed by the
+ * corresponding regression tests.).
I think we are not setting the "max_fdw_transctions" anywhere.
Is this correct?
In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two times.
+ #include "access/fdw_xact.h"
I think we need to remove one line.
Regards,
Vinayak Pokale
On Thu, Mar 2, 2017 at 11:56 AM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: > > On 2017/02/28 16:54, Masahiko Sawada wrote: > > I've created a wiki page[1] describing about the design and > functionality of this feature. Also it has some examples of use case, > so this page would be helpful for even testing. Please refer it if > you're interested in testing this feature. > > [1] 2PC on FDW > <https://wiki.postgresql.org/wiki/2PC_on_FDW> > > Thank you for creating the wiki page. Thank you for looking at this patch. > In the "src/test/regress/pg_regress.c" file > - * xacts. (Note: to reduce the probability of unexpected > shmmax > - * failures, don't set max_prepared_transactions any higher > than > - * actually needed by the prepared_xacts regression test.) > + * xacts. We also set max_fdw_transctions to enable testing > of atomic > + * foreign transactions. (Note: to reduce the probability of > unexpected > + * shmmax failures, don't set max_prepared_transactions or > + * max_prepared_foreign_transactions any higher than > actually needed by the > + * corresponding regression tests.). > > I think we are not setting the "max_fdw_transctions" anywhere. > Is this correct? This comment is out of date. Will fix. > > In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two > times. > + #include "access/fdw_xact.h" > I think we need to remove one line. > Not necessary. Will get rid of it. Since these are not feature bugs I will incorporate these when making update version patches. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 3, 2017 at 1:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Mar 2, 2017 at 11:56 AM, vinayak > <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >> >> On 2017/02/28 16:54, Masahiko Sawada wrote: >> >> I've created a wiki page[1] describing about the design and >> functionality of this feature. Also it has some examples of use case, >> so this page would be helpful for even testing. Please refer it if >> you're interested in testing this feature. >> >> [1] 2PC on FDW >> <https://wiki.postgresql.org/wiki/2PC_on_FDW> >> >> Thank you for creating the wiki page. > > Thank you for looking at this patch. > >> In the "src/test/regress/pg_regress.c" file >> - * xacts. (Note: to reduce the probability of unexpected >> shmmax >> - * failures, don't set max_prepared_transactions any higher >> than >> - * actually needed by the prepared_xacts regression test.) >> + * xacts. We also set max_fdw_transctions to enable testing >> of atomic >> + * foreign transactions. (Note: to reduce the probability of >> unexpected >> + * shmmax failures, don't set max_prepared_transactions or >> + * max_prepared_foreign_transactions any higher than >> actually needed by the >> + * corresponding regression tests.). >> >> I think we are not setting the "max_fdw_transctions" anywhere. >> Is this correct? > > This comment is out of date. Will fix. > >> >> In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two >> times. >> + #include "access/fdw_xact.h" >> I think we need to remove one line. >> > > Not necessary. Will get rid of it. > > Since these are not feature bugs I will incorporate these when making > update version patches. > Attached updated set of patches. The differences from previous patch are, * Fixed a few bugs. * Separated previous 000 patch into two patches. * Changed name pg_fdw_xact_resovler contrib module to fdw_transaction_resolver. * Incorporated review comments got from Vinayak Please review these patches. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Mar 7, 2017 at 5:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Mar 3, 2017 at 1:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Mar 2, 2017 at 11:56 AM, vinayak >> <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote: >>> >>> On 2017/02/28 16:54, Masahiko Sawada wrote: >>> >>> I've created a wiki page[1] describing about the design and >>> functionality of this feature. Also it has some examples of use case, >>> so this page would be helpful for even testing. Please refer it if >>> you're interested in testing this feature. >>> >>> [1] 2PC on FDW >>> <https://wiki.postgresql.org/wiki/2PC_on_FDW> >>> >>> Thank you for creating the wiki page. >> >> Thank you for looking at this patch. >> >>> In the "src/test/regress/pg_regress.c" file >>> - * xacts. (Note: to reduce the probability of unexpected >>> shmmax >>> - * failures, don't set max_prepared_transactions any higher >>> than >>> - * actually needed by the prepared_xacts regression test.) >>> + * xacts. We also set max_fdw_transctions to enable testing >>> of atomic >>> + * foreign transactions. (Note: to reduce the probability of >>> unexpected >>> + * shmmax failures, don't set max_prepared_transactions or >>> + * max_prepared_foreign_transactions any higher than >>> actually needed by the >>> + * corresponding regression tests.). >>> >>> I think we are not setting the "max_fdw_transctions" anywhere. >>> Is this correct? >> >> This comment is out of date. Will fix. >> >>> >>> In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two >>> times. >>> + #include "access/fdw_xact.h" >>> I think we need to remove one line. >>> >> >> Not necessary. Will get rid of it. >> >> Since these are not feature bugs I will incorporate these when making >> update version patches. >> > > Attached updated set of patches. > The differences from previous patch are, > * Fixed a few bugs. > * Separated previous 000 patch into two patches. > * Changed name pg_fdw_xact_resovler contrib module to > fdw_transaction_resolver. > * Incorporated review comments got from Vinayak > > Please review these patches. > Since previous v9 patches conflict with current HEAD, I attached latest patches. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed I have tested the latest patch and it looks good to me, so I marked it "Ready for committer". Anyway, it would be great if anyone could also have a look at the patches and send comments. The new status of this patch is: Ready for Committer
On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale <pokale_vinayak_q3@lab.ntt.co.jp> wrote: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: tested, passed > > I have tested the latest patch and it looks good to me, > so I marked it "Ready for committer". > Anyway, it would be great if anyone could also have a look at the patches and send comments. > > The new status of this patch is: Ready for Committer > Thank you for updating but I found a bug in 001 patch. Attached latest patches. The differences are * Fixed a bug. * Ran pgindent. * Separated the patch supporting GetPrepareID API. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Mar 22, 2017 at 2:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale > <pokale_vinayak_q3@lab.ntt.co.jp> wrote: >> The following review has been posted through the commitfest application: >> make installcheck-world: tested, passed >> Implements feature: tested, passed >> Spec compliant: tested, passed >> Documentation: tested, passed >> >> I have tested the latest patch and it looks good to me, >> so I marked it "Ready for committer". >> Anyway, it would be great if anyone could also have a look at the patches and send comments. >> >> The new status of this patch is: Ready for Committer >> > > Thank you for updating but I found a bug in 001 patch. Attached latest patches. > The differences are > * Fixed a bug. > * Ran pgindent. > * Separated the patch supporting GetPrepareID API. > Since previous patches conflict with current HEAD, I attached latest set of patches. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, Mar 29, 2017 at 11:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Mar 22, 2017 at 2:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale >> <pokale_vinayak_q3@lab.ntt.co.jp> wrote: >>> The following review has been posted through the commitfest application: >>> make installcheck-world: tested, passed >>> Implements feature: tested, passed >>> Spec compliant: tested, passed >>> Documentation: tested, passed >>> >>> I have tested the latest patch and it looks good to me, >>> so I marked it "Ready for committer". >>> Anyway, it would be great if anyone could also have a look at the patches and send comments. >>> >>> The new status of this patch is: Ready for Committer >>> >> >> Thank you for updating but I found a bug in 001 patch. Attached latest patches. >> The differences are >> * Fixed a bug. >> * Ran pgindent. >> * Separated the patch supporting GetPrepareID API. >> > > Since previous patches conflict with current HEAD, I attached latest > set of patches. > Vinayak, why did you marked this patch as "Move to next CF"? AFAIU there is not discussion yet. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Apr 7, 2017 at 10:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Vinayak, why did you marked this patch as "Move to next CF"? AFAIU > there is not discussion yet. I'd like to discuss this patch. Clearly, a lot of work has been done here, but I am not sure about the approach. If we were to commit this patch set, then you could optionally enable two_phase_commit for a postgres_fdw foreign server. If you did, then, modulo bugs and administrator shenanigans, and given proper configuration, you would be guaranteed that a successful commit of a transaction which touched postgres_fdw foreign tables would eventually end up committed or rolled back on all of the nodes, rather than committed on some and rolled back on others. However, you would not be guaranteed that all of those commits or rollbacks happen at anything like the same time. So, you would have a sort of eventual consistency. Any given snapshot might not be consistent, but if you waited long enough and with all nodes online, eventually all distributed transactions would be resolved in a consistent manner. That's kinda cool, but I think what people really want is a stronger guarantee, namely, that they will get consistent snapshots. It's not clear to me that this patch gets us any closer to that goal. Does anyone have a plan for how we'd get from here to that stronger goal? If not, is the patch useful enough to justify committing it for what it can already do? It would be particularly good to hear some end-user views on this functionality and whether or not they would use it and find it valuable. On a technical level, I am pretty sure that it is not OK to call AtEOXact_FDWXacts() from the sections of CommitTransaction, AbortTransaction, and PrepareTransaction that are described as "non-critical resource releasing". At that point, it's too late to throw an error, and it is very difficult to imagine something that involves a TCP connection to another machine not being subject to error. You might say "well, we can just make sure that any problems are reporting as a WARNING rather than an ERROR", but that's pretty hard to guarantee; most backend code assumes it can ERROR, so anything you call is a potential hazard. There is a second problem, too: any code that runs from here is not interruptible. The user can hit ^C all day and nothing will happen. That's a bad situation when you're busy doing network I/O. I'm not exactly sure what the best thing to do about this problem would be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 27, 2017 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 7, 2017 at 10:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Vinayak, why did you marked this patch as "Move to next CF"? AFAIU >> there is not discussion yet. > > I'd like to discuss this patch. Clearly, a lot of work has been done > here, but I am not sure about the approach. Thank you for the comment. I'd like to reply about the goal of this feature first. > If we were to commit this patch set, then you could optionally enable > two_phase_commit for a postgres_fdw foreign server. If you did, then, > modulo bugs and administrator shenanigans, and given proper > configuration, you would be guaranteed that a successful commit of a > transaction which touched postgres_fdw foreign tables would eventually > end up committed or rolled back on all of the nodes, rather than > committed on some and rolled back on others. However, you would not > be guaranteed that all of those commits or rollbacks happen at > anything like the same time. So, you would have a sort of eventual > consistency. Any given snapshot might not be consistent, but if you > waited long enough and with all nodes online, eventually all > distributed transactions would be resolved in a consistent manner. > That's kinda cool, but I think what people really want is a stronger > guarantee, namely, that they will get consistent snapshots. It's not > clear to me that this patch gets us any closer to that goal. Does > anyone have a plan for how we'd get from here to that stronger goal? > If not, is the patch useful enough to justify committing it for what > it can already do? It would be particularly good to hear some > end-user views on this functionality and whether or not they would use > it and find it valuable. Yeah, this patch only guarantees that if you got a commit the transaction either committed or rollback-ed on all relevant nodes. And subsequent transactions can see a consistent result (if the server failed we have to recover in-doubt transactions properly from a crash). But it doesn't guarantees that a concurrent transaction can see a consistent result. To provide seeing cluster-wide consistent result, I think we need a transaction manager for distributed queries which is responsible for providing consistent snapshots. There were some discussions of the type of transaction manager but at least we need a new transaction manager for distributed queries. I think the providing a consistent result to concurrent transactions and the committing or rollback-ing atomically a transaction should be separated features, and should be discussed separately. It's not useful and users would complain if we provide a consistent snapshot but a distributed transaction could commit on part of nodes. So this patch could be also an important feature for providing consistent result. > On a technical level, I am pretty sure that it is not OK to call > AtEOXact_FDWXacts() from the sections of CommitTransaction, > AbortTransaction, and PrepareTransaction that are described as > "non-critical resource releasing". At that point, it's too late to > throw an error, and it is very difficult to imagine something that > involves a TCP connection to another machine not being subject to > error. You might say "well, we can just make sure that any problems > are reporting as a WARNING rather than an ERROR", but that's pretty > hard to guarantee; most backend code assumes it can ERROR, so anything > you call is a potential hazard. There is a second problem, too: any > code that runs from here is not interruptible. The user can hit ^C > all day and nothing will happen. That's a bad situation when you're > busy doing network I/O. I'm not exactly sure what the best thing to > do about this problem would be. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> On 27 Jul 2017, at 04:28, Robert Haas <robertmhaas@gmail.com> wrote: > > However, you would not > be guaranteed that all of those commits or rollbacks happen at > anything like the same time. So, you would have a sort of eventual > consistency. As far as I understand any solution that provides proper isolation for distributed transactions in postgres will require distributed 2PC somewhere under the hood. That is just consequence of parallelism that database allows — transactions can abort due concurrent operations. So dichotomy is simple: either we need 2PC or restrict write transactions to be physically serial. In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to commit distributed transaction. Some years ago we created patches to implement transaction manager API and that is just a way to inject consistent snapshots on different nodes, but atomic commit itself is out of scope of TM API (hmm, may be it is better to call it snapshot manager api?). That allows us to use it in quite different environments like fdw and logical replication and both are using 2PC. I want to submit TM API again during this release cycle along with implementation for fdw. And I planned to base it on top of this patch. So I already rebased Masahiko’s patch to current -master and started writing long list of nitpicks, but not finished yet. Also I see the quite a big value in this patch even without tm/snapshots/whatever. Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t doing cross-node analytical transactions it will be safe to live without isolation. But living without atomicity means that some parts of data can be lost without simple way to detect and fix that. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Jul 27, 2017 at 5:08 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > As far as I understand any solution that provides proper isolation for distributed > transactions in postgres will require distributed 2PC somewhere under the hood. > That is just consequence of parallelism that database allows — transactions can > abort due concurrent operations. So dichotomy is simple: either we need 2PC or > restrict write transactions to be physically serial. > > In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to > commit distributed transaction. Ah, OK. I was imagining that a transaction manager might be responsible for managing both snapshots and distributed commit. But if the transaction manager only handles the snapshots (how?) and the commit has to be done using 2PC, then we need this. > Also I see the quite a big value in this patch even without tm/snapshots/whatever. > Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t > doing cross-node analytical transactions it will be safe to live without isolation. > But living without atomicity means that some parts of data can be lost without simple > way to detect and fix that. OK, thanks for weighing in. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 27, 2017 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On a technical level, I am pretty sure that it is not OK to call > AtEOXact_FDWXacts() from the sections of CommitTransaction, > AbortTransaction, and PrepareTransaction that are described as > "non-critical resource releasing". At that point, it's too late to > throw an error, and it is very difficult to imagine something that > involves a TCP connection to another machine not being subject to > error. You might say "well, we can just make sure that any problems > are reporting as a WARNING rather than an ERROR", but that's pretty > hard to guarantee; most backend code assumes it can ERROR, so anything > you call is a potential hazard. There is a second problem, too: any > code that runs from here is not interruptible. The user can hit ^C > all day and nothing will happen. That's a bad situation when you're > busy doing network I/O. I'm not exactly sure what the best thing to > do about this problem would be. > The remote transaction can be committed/aborted only after the fate of the local transaction is decided. If we commit remote transaction and abort local transaction, that's not good. AtEOXact* functions are called immediately after that decision in post-commit/abort phase. So, if we want to commit/abort the remote transaction immediately it has to be done in post-commit/abort processing. Instead if we delegate that to the remote transaction resolved backend (introduced by the patches) the delay between local commit and remote commits depends upon when the resolve gets a chance to run and process those transactions. One could argue that that delay would anyway exist when post-commit/abort processing fails to resolve remote transaction. But given the real high availability these days, in most of the cases remote transaction will be resolved in the post-commit/abort phase. I think we should optimize for most common case. Your concern is still valid, that we shouldn't raise an error or do anything critical in post-commit/abort phase. So we should device a way to send COMMIT/ABORT prepared messages to the remote server in asynchronous fashion carefully avoiding errors. Recent changes to 2PC have improved performance in that area to a great extent. Relying on resolver backend to resolve remote transactions would erode that performance gain. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Jul 27, 2017 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jul 27, 2017 at 5:08 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> As far as I understand any solution that provides proper isolation for distributed >> transactions in postgres will require distributed 2PC somewhere under the hood. >> That is just consequence of parallelism that database allows — transactions can >> abort due concurrent operations. So dichotomy is simple: either we need 2PC or >> restrict write transactions to be physically serial. >> >> In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to >> commit distributed transaction. > > Ah, OK. I was imagining that a transaction manager might be > responsible for managing both snapshots and distributed commit. But > if the transaction manager only handles the snapshots (how?) and the > commit has to be done using 2PC, then we need this. One way to provide snapshots to participant nodes is giving a snapshot data to them using libpq protocol with the query when coordinator nodes starts transaction on a remote node (or we now can use exporting snapshot infrastructure?). IIUC Postgres-XL/XC uses this approach. That also requires to share the same XID space with all remote nodes. Perhaps the CSN based snapshot can make this more simple. >> Also I see the quite a big value in this patch even without tm/snapshots/whatever. >> Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t >> doing cross-node analytical transactions it will be safe to live without isolation. >> But living without atomicity means that some parts of data can be lost without simple >> way to detect and fix that. > > OK, thanks for weighing in. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Jul 28, 2017 at 7:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > That also requires to share the same XID space with all remote nodes. You are putting your finger on the main bottleneck with global consistency that XC and XL has because of that. And the source feeding the XIDs is a SPOF. > Perhaps the CSN based snapshot can make this more simple. Hm. This needs a closer look. -- Michael
On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > The remote transaction can be committed/aborted only after the fate of > the local transaction is decided. If we commit remote transaction and > abort local transaction, that's not good. AtEOXact* functions are > called immediately after that decision in post-commit/abort phase. So, > if we want to commit/abort the remote transaction immediately it has > to be done in post-commit/abort processing. Instead if we delegate > that to the remote transaction resolved backend (introduced by the > patches) the delay between local commit and remote commits depends > upon when the resolve gets a chance to run and process those > transactions. One could argue that that delay would anyway exist when > post-commit/abort processing fails to resolve remote transaction. But > given the real high availability these days, in most of the cases > remote transaction will be resolved in the post-commit/abort phase. I > think we should optimize for most common case. Your concern is still > valid, that we shouldn't raise an error or do anything critical in > post-commit/abort phase. So we should device a way to send > COMMIT/ABORT prepared messages to the remote server in asynchronous > fashion carefully avoiding errors. Recent changes to 2PC have improved > performance in that area to a great extent. Relying on resolver > backend to resolve remote transactions would erode that performance > gain. I think there are two separate but interconnected issues here. One is that if we give the user a new command prompt without resolving the remote transaction, then they might run a new query that sees their own work as committed, which would be bad. Or, they might commit, wait for the acknowledgement, and then tell some other session to go look at the data, and find it not there. That would also be bad. I think the solution is likely to do something like what we did for synchronous replication in commit 9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove transaction to be resolved (by the background process) but allow an interrupt to escape the wait-loop. The second issue is that having the resolver resolve transactions might be slower than doing it in the foreground. I don't necessarily see a reason why that should be a big problem. I mean, the resolver might need to establish a separate connection, but if it keeps that connection open for a while (say, 5 minutes) in case further transactions arrive then it won't be an issue except on really low-volume system which isn't really a case I think we need to worry about very much. Also, the hand-off to the resolver might take some time, but that's equally true for sync rep and we're living with it there. Anything else is presumably just the resolver itself being inefficient which seems like something that can simply be fixed. FWIW, I don't think the present resolver implementation is likely to be what we want. IIRC, it's just calling an SQL function which doesn't seem like a good approach. Ideally we should stick an entry into a shared memory queue and then ping the resolver via SetLatch, and it can directly invoke an FDW method on the data from the shared memory queue. It should be possible to set things up so that a user who wishes to do so can run multiple copies of the resolver thread at the same time, which would be a good way to keep latency down if the system is very busy with distributed transactions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 28, 2017 at 10:14 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jul 28, 2017 at 7:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> That also requires to share the same XID space with all remote nodes. > > You are putting your finger on the main bottleneck with global > consistency that XC and XL has because of that. And the source feeding > the XIDs is a SPOF. > >> Perhaps the CSN based snapshot can make this more simple. > > Hm. This needs a closer look. With or without CSNs, sharing the same XID space across all nodes is undesirable in a loosely-coupled network. If only a small fraction of transactions are distributed, incurring the overhead of synchronizing XID assignment for every transaction is not good. Suppose node A processes many transactions and node B only a few transactions; then, XID advancement caused by node A forces node B to perform vacuum for wraparound. Not fun. Or, if you have an OLTP workload running on A and an OLTP workload running B that are independent of each other, and occasional reporting queries that scan both, you'll be incurring the overhead of keeping A and B consistent for a lot of transactions that don't need it. Of course, when A and B are tightly coupled and basically all transactions are scanning both, locking the XID space together *may* be the best approach, but even then there are notable disadvantages - e.g. they can't both continue processing write transactions if the connection between the two is severed. An alternative approach is to have some kind of other identifier, let's call it a distributed transaction ID (DXID) which is mapped by each node onto a local XID. Regardless of whether we share XIDs or DXIDs, we need a more complex concept of transaction state than we have now. Right now, transactions are basically in-progress, committed, or aborted, but there's also the state where the status of the transaction is known by someone but not locally. You can imagine something like: during the prepare phase, all nodes set the status in clog to "prepared". Then, if that succeeds, the leader changes the status to "committed" or "aborted" and tells all nodes to do the same. Thereafter, any time someone inquires about the status of that transaction, we go ask all of the other nodes in the cluster; if they all think it's prepared, then it's prepared -- but if any of them think it's committed or aborted, then we change our local status to match and return that status. So once one node changes the status to committed or aborted it can propagate through the cluster even if connectivity is lost for a while. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > An alternative approach is to have some kind of other identifier, > let's call it a distributed transaction ID (DXID) which is mapped by > each node onto a local XID. Postgres-XL seems to manage this problem by using a transaction manager node, which is in charge of assigning snapshots. I don't know how that works, but perhaps adding that concept here could be useful too. One critical point to that design is that the app connects not directly to the underlying Postgres server but instead to some other node which is or connects to the node that manages the snapshots. Maybe Michael can explain in better detail how it works, and/or how (and if) it could be applied here. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 31, 2017 at 1:27 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Postgres-XL seems to manage this problem by using a transaction manager > node, which is in charge of assigning snapshots. I don't know how that > works, but perhaps adding that concept here could be useful too. One > critical point to that design is that the app connects not directly to > the underlying Postgres server but instead to some other node which is > or connects to the node that manages the snapshots. > > Maybe Michael can explain in better detail how it works, and/or how (and > if) it could be applied here. I suspect that if you've got a central coordinator server that is the jumping-off point for all distributed transactions, the Postgres-XL approach is hard to beat (at least in concept, not sure about the implementation). That server is brokering all of the connections to the data nodes anyway, so it might as well tell them all what snapshots to use while it's there. When you scale to multiple coordinators, though, it's less clear that it's the best approach. Now one coordinator has to be the GTM master, and that server is likely to become a bottleneck -- plus talking to it involves extra network hops for all the other coordinators. When you then move the needle a bit further and imagine a system where the idea of a coordinator doesn't even exist, and you've just got a loosely couple distributed system where distributed transactions might arrive on any node, all of which are also servicing local transactions, then it seems pretty likely that the Postgres-XL approach is not the best fit. We might want to support multiple models. Which one to support first is a harder question. The thing I like least about the Postgres-XC approach is it seems inevitable that, as Michael says, the central server handing out XIDs and snapshots is bound to become a bottleneck. That type of system implicitly constructs a total order of all distributed transactions, but we don't really need a total order. If two transactions don't touch the same data and there's no overlapping transaction that can notice the commit order, then we could make those commit decisions independently on different nodes without caring which one "happens first". The problem is that it might take so much bookkeeping to figure out whether that is in fact the case in a particular instance that it's even more expensive than having a central server that functions as a global bottleneck. It might be worth some study not only of Postgres-XL but also of other databases that claim to provide distributed transactional consistency across nodes. I've found literature on this topic from time to time over the years, but I'm not sure what the best practices in this area actually are. https://en.wikipedia.org/wiki/Global_serializability claims that a technique called Commitment Ordering (CO) is teh awesome, but I've got my doubts about whether that's really an objective description of the state of the art. One clue is that the global serialiazability article says three separate times that the technique has been widely misunderstood. I'm not sure exactly which Wikipedia guideline that violates, but I think Wikipedia is supposed to summarize the views that exist on a topic in accordance with their prevalence, not take a position on which view is correct. https://en.wikipedia.org/wiki/Commitment_ordering contains citations from the papers only of one guy, Yoav Raz, which is another hint that this may not be as widely-regarded a technique as the person who wrote these articles thinks it should be. Anyway, it would be good to understand what other well-regarded systems do before we choose what we want to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 1, 2017 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jul 31, 2017 at 1:27 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Postgres-XL seems to manage this problem by using a transaction manager >> node, which is in charge of assigning snapshots. I don't know how that >> works, but perhaps adding that concept here could be useful too. One >> critical point to that design is that the app connects not directly to >> the underlying Postgres server but instead to some other node which is >> or connects to the node that manages the snapshots. >> >> Maybe Michael can explain in better detail how it works, and/or how (and >> if) it could be applied here. > > I suspect that if you've got a central coordinator server that is the > jumping-off point for all distributed transactions, the Postgres-XL > approach is hard to beat (at least in concept, not sure about the > implementation). That server is brokering all of the connections to > the data nodes anyway, so it might as well tell them all what > snapshots to use while it's there. When you scale to multiple > coordinators, though, it's less clear that it's the best approach. > Now one coordinator has to be the GTM master, and that server is > likely to become a bottleneck -- plus talking to it involves extra > network hops for all the other coordinators. When you then move the > needle a bit further and imagine a system where the idea of a > coordinator doesn't even exist, and you've just got a loosely couple > distributed system where distributed transactions might arrive on any > node, all of which are also servicing local transactions, then it > seems pretty likely that the Postgres-XL approach is not the best fit. > > We might want to support multiple models. Which one to support first > is a harder question. The thing I like least about the Postgres-XC > approach is it seems inevitable that, as Michael says, the central > server handing out XIDs and snapshots is bound to become a bottleneck. > That type of system implicitly constructs a total order of all > distributed transactions, but we don't really need a total order. If > two transactions don't touch the same data and there's no overlapping > transaction that can notice the commit order, then we could make those > commit decisions independently on different nodes without caring which > one "happens first". The problem is that it might take so much > bookkeeping to figure out whether that is in fact the case in a > particular instance that it's even more expensive than having a > central server that functions as a global bottleneck. > > It might be worth some study not only of Postgres-XL but also of other > databases that claim to provide distributed transactional consistency > across nodes. I've found literature on this topic from time to time > over the years, but I'm not sure what the best practices in this area > actually are. Yeah it's worth to study other databases and to consider the approach that goes well with the PostgreSQL architecture. I've read some papers related to distributed transaction management but I'm also not sure what the best practices in this area are. However, one trend I've seen is that some cloud-native databases such as Google Spanner[1] and Cockroachdb employs the tecniques using timestamps to determine the visibility without centralized coordination. Google Spanner uses GPS clocks and atomic clocks but since these are not common hardware Cockroachdb uses local timestamps with NTP instead. Also, other transaction techniques using local timestamp have been discussed. For example Clock-SI[2] derives snapshots and commit timestamps from loosely synchronized physical clocks, though it doesn't support serializable isolation level. IIUC postgrespro multi-master cluster employs the technique based on that. I've not read deeply yet but I found new paper[3] on last week which introduces new SI mechanism that allows transactions to determine their timestamps autonomously, without relying on centralized coordination. PostgreSQL uses XID to determine visibility now but mapping XID to its timestamp using commit timestmap feature might be able to allow PostgreSQL to use the timestamp for that purpose. > https://en.wikipedia.org/wiki/Global_serializability > claims that a technique called Commitment Ordering (CO) is teh > awesome, but I've got my doubts about whether that's really an > objective description of the state of the art. One clue is that the > global serialiazability article says three separate times that the > technique has been widely misunderstood. I'm not sure exactly which > Wikipedia guideline that violates, but I think Wikipedia is supposed > to summarize the views that exist on a topic in accordance with their > prevalence, not take a position on which view is correct. > https://en.wikipedia.org/wiki/Commitment_ordering contains citations > from the papers only of one guy, Yoav Raz, which is another hint that > this may not be as widely-regarded a technique as the person who wrote > these articles thinks it should be. Anyway, it would be good to > understand what other well-regarded systems do before we choose what > we want to do. [1] https://research.google.com/archive/spanner.html [2] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf [3] https://arxiv.org/pdf/1704.01355.pdf Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Aug 1, 2017 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The remote transaction can be committed/aborted only after the fate of >> the local transaction is decided. If we commit remote transaction and >> abort local transaction, that's not good. AtEOXact* functions are >> called immediately after that decision in post-commit/abort phase. So, >> if we want to commit/abort the remote transaction immediately it has >> to be done in post-commit/abort processing. Instead if we delegate >> that to the remote transaction resolved backend (introduced by the >> patches) the delay between local commit and remote commits depends >> upon when the resolve gets a chance to run and process those >> transactions. One could argue that that delay would anyway exist when >> post-commit/abort processing fails to resolve remote transaction. But >> given the real high availability these days, in most of the cases >> remote transaction will be resolved in the post-commit/abort phase. I >> think we should optimize for most common case. Your concern is still >> valid, that we shouldn't raise an error or do anything critical in >> post-commit/abort phase. So we should device a way to send >> COMMIT/ABORT prepared messages to the remote server in asynchronous >> fashion carefully avoiding errors. Recent changes to 2PC have improved >> performance in that area to a great extent. Relying on resolver >> backend to resolve remote transactions would erode that performance >> gain. > > I think there are two separate but interconnected issues here. One is > that if we give the user a new command prompt without resolving the > remote transaction, then they might run a new query that sees their > own work as committed, which would be bad. Or, they might commit, > wait for the acknowledgement, and then tell some other session to go > look at the data, and find it not there. That would also be bad. I > think the solution is likely to do something like what we did for > synchronous replication in commit > 9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove > transaction to be resolved (by the background process) but allow an > interrupt to escape the wait-loop. > > The second issue is that having the resolver resolve transactions > might be slower than doing it in the foreground. I don't necessarily > see a reason why that should be a big problem. I mean, the resolver > might need to establish a separate connection, but if it keeps that > connection open for a while (say, 5 minutes) in case further > transactions arrive then it won't be an issue except on really > low-volume system which isn't really a case I think we need to worry > about very much. Also, the hand-off to the resolver might take some > time, but that's equally true for sync rep and we're living with it > there. Anything else is presumably just the resolver itself being > inefficient which seems like something that can simply be fixed. I think using the solution similar to sync rep to wait for the transaction to be resolved is a good way. One concern I have is that if we have one resolver process per one backend process the switching connection between participant nodes would be overhead. In current implementation the backend process uses connection caches to the remote server. On the other hand if we have one resolver process per one database on remote server the backend process have to communicate with multiple resolver processes. > FWIW, I don't think the present resolver implementation is likely to > be what we want. IIRC, it's just calling an SQL function which > doesn't seem like a good approach. Ideally we should stick an entry > into a shared memory queue and then ping the resolver via SetLatch, > and it can directly invoke an FDW method on the data from the shared > memory queue. It should be possible to set things up so that a user > who wishes to do so can run multiple copies of the resolver thread at > the same time, which would be a good way to keep latency down if the > system is very busy with distributed transactions. > In current implementation the resolver process exists for resolving in-doubt transactions. That process periodically checks if there is unresolved transaction on shared memory and tries to resolve it according commit log. If we change it so that the backend process can communicate with the resolver process via SetLatch the resolver process is better to be implemented into core rather than as a contrib module. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> On 31 Jul 2017, at 20:03, Robert Haas <robertmhaas@gmail.com> wrote: > > Regardless of whether we share XIDs or DXIDs, we need a more complex > concept of transaction state than we have now. Seems that discussion shifted from 2PC itself to the general issues with distributed transactions. So it is probably appropriate to share here resume of things that we done in area of distributed visibility. During last two years we tried three quite different approaches and finally settled with Clock-SI. At first, to test different approaches we did small patch that wrap calls to visibility-related functions (SetTransactionStatus, GetSnapshot, etc. Described in detail at wiki[1] ) in order to allow overload them from extension. Such approach allows to implement almost anything related to distributed visibility since you have full control about how local visibility is done. That API isn’t hard prerequisite, and if one wants to create some concrete implementation it can be done just in place. However, I think it is good to have such API in some form. So three approaches that we tried: 1) Postgres-XL-like: That is most straightforward way. Basically we need separate network service (GTM/DTM) that is responsible for xid generation, and managing running-list of transactions. So acquiring xid and snapshot is done by network calls. Because of shared xid space it is possible to compare them in ordinary way and get right order. Gap between non-simultaneous commits by 2pc is covered by the fact that we getting our snapshots from GTM, and it will remove xid from running list only when transaction committed on both nodes. Such approach is okay for OLAP-style transactions where tps isn’t high. But OLTP with high transaction rate GTM will immediately became a bottleneck since even write transactions need to get snapshot from GTM. Even if they access only one node. 2) Incremental SI [2] Approach with central coordinator, that can allow local reads without network communications by slightly altering visibility rules. Despite the fact that it is kind of patented, we also failed to achieve proper visibility by implementing algorithms from that paper. It always showed some inconsistencies. May be because of bugs in our implementation, may be because of some typos/mistakes in algorithm description itself. Reasoning in paper wasn’t very clear for us, as well as patent issues, so we just leaved that. 3) Clock-SI [3] It is MS research paper, that describes algorithm similar to ones used in Spanner and CockroachDB, without central GTM and with reads that do not require network roundtrip. There are two ideas behind it: * Assuming snapshot isolation and visibility on node are based on CSN, use local time as CSN, then when you are doing 2PC, collect prepare time from all participating nodes and commit transaction everywhere with maximum of that times. If node during read faces tuples committed by tx with CSN greater then their snapshot CSN (that can happen due to time desynchronisation on node) then it just waits until that time come. So time desynchronisation can affect performance, but can’t affect correctness. * During distributed commit transaction neither running (if it commits then tuple should be already visible) nor committed/aborted (it still can be aborted, so it is illegal to read). So here IN-DOUBT transaction state appears, when reader should wait for writers. We managed to implement that using mentioned XTM api. XID<->CSN mapping is accounted by extension itself. Speed/scalability are also good. I want to resubmit implementation of that algorithm for FDW later in August, along with some isolation tests based on set of queries in [4]. [1] https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API [2] http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf [3] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf [4] https://github.com/ept/hermitage Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Jul 31, 2017 at 7:27 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: > >> An alternative approach is to have some kind of other identifier, >> let's call it a distributed transaction ID (DXID) which is mapped by >> each node onto a local XID. > > Postgres-XL seems to manage this problem by using a transaction manager > node, which is in charge of assigning snapshots. I don't know how that > works, but perhaps adding that concept here could be useful too. One > critical point to that design is that the app connects not directly to > the underlying Postgres server but instead to some other node which is > or connects to the node that manages the snapshots. > > Maybe Michael can explain in better detail how it works, and/or how (and > if) it could be applied here. XL (and XC) use a transaction ID that plugs in directly with the internal XID assigned by Postgres, actually bypassing what Postgres assigns to each backend if a transaction needs one. So if transactions are not heavenly shared among multiple nodes, performance gets impacted. Now when we worked on this project we noticed that we gained in performance by reducing the number of requests and grouping them together, so a proxy layer has been added between the global transaction manager and Postgres to group those requests. This does not change the fact that read-committed transactions still need snapshots for each query, which is consuming. So this approach hurts less with analytic queries, and more with OLTP. 2PC transaction status was tracked as well in the GTM. This allows fancy things like being able to prepare a transaction on node 1, and commit it on node 2 for example. I am not honestly sure that you need to add anything at clog level for example, but I think that having at the FDW level the meta data of a transaction stored as a rather correct approach on the matter. That's what greenplum actually does if I recall correctly (Heikki save me!): it has one coordinator with such metadata handling, and bunch of underlying nodes that store the data. Citus does also that if I recall correctly. So instead of decentralizing this information, this gets stored in a Postgres coordinator instance. -- Michael
On Tue, Aug 1, 2017 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> The remote transaction can be committed/aborted only after the fate of >> the local transaction is decided. If we commit remote transaction and >> abort local transaction, that's not good. AtEOXact* functions are >> called immediately after that decision in post-commit/abort phase. So, >> if we want to commit/abort the remote transaction immediately it has >> to be done in post-commit/abort processing. Instead if we delegate >> that to the remote transaction resolved backend (introduced by the >> patches) the delay between local commit and remote commits depends >> upon when the resolve gets a chance to run and process those >> transactions. One could argue that that delay would anyway exist when >> post-commit/abort processing fails to resolve remote transaction. But >> given the real high availability these days, in most of the cases >> remote transaction will be resolved in the post-commit/abort phase. I >> think we should optimize for most common case. Your concern is still >> valid, that we shouldn't raise an error or do anything critical in >> post-commit/abort phase. So we should device a way to send >> COMMIT/ABORT prepared messages to the remote server in asynchronous >> fashion carefully avoiding errors. Recent changes to 2PC have improved >> performance in that area to a great extent. Relying on resolver >> backend to resolve remote transactions would erode that performance >> gain. > > I think there are two separate but interconnected issues here. One is > that if we give the user a new command prompt without resolving the > remote transaction, then they might run a new query that sees their > own work as committed, which would be bad. Or, they might commit, > wait for the acknowledgement, and then tell some other session to go > look at the data, and find it not there. That would also be bad. I > think the solution is likely to do something like what we did for > synchronous replication in commit > 9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove > transaction to be resolved (by the background process) but allow an > interrupt to escape the wait-loop. > > The second issue is that having the resolver resolve transactions > might be slower than doing it in the foreground. I don't necessarily > see a reason why that should be a big problem. I mean, the resolver > might need to establish a separate connection, but if it keeps that > connection open for a while (say, 5 minutes) in case further > transactions arrive then it won't be an issue except on really > low-volume system which isn't really a case I think we need to worry > about very much. Also, the hand-off to the resolver might take some > time, but that's equally true for sync rep and we're living with it > there. Anything else is presumably just the resolver itself being > inefficient which seems like something that can simply be fixed. > > FWIW, I don't think the present resolver implementation is likely to > be what we want. IIRC, it's just calling an SQL function which > doesn't seem like a good approach. Ideally we should stick an entry > into a shared memory queue and then ping the resolver via SetLatch, > and it can directly invoke an FDW method on the data from the shared > memory queue. It should be possible to set things up so that a user > who wishes to do so can run multiple copies of the resolver thread at > the same time, which would be a good way to keep latency down if the > system is very busy with distributed transactions. > Based on the review comment from Robert, I'm planning to do the big change to the architecture of this patch so that a backend process work together with a dedicated background worker that is responsible for resolving the foreign transactions. For the usage of this feature, it will be almost the same as what this patch has been doing except for adding a new GUC paramter that controls the number of resovler process launch. That is, we can have multiple resolver process to keep latency down. On technical view, the processing of the transaction involving multiple foreign server will be changed as follows. * Backend processes 1. In PreCommit phase, prepare the transaction on foreign servers and save fdw_xact entries into the array on shmem. Also create a fdw_xact_state entry on shmem hash that has the index of each fdw_xact entry. 2. Local commit/abort. 3. Change its process state to FDWXACT_WAITING and enqueue the MyProc to the shmem queue. 4. Ping to the resolver process via SetLatch. 5. Wait to be waken up. * Resovler processes 1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a). 2. Get the fdw_xact_state entry from shmem hash by XID-a. 3. Iterate fdw_xact entries using the index, and resolve the foreign transactions. 3-a. If even one foreign transaction failed to resolve, raise an error. 4. Change the waiting backend state to FDWXACT_COMPLETED and release it. Also, the resolver process scans over the array of fdw_xact entry periodically, and tries to resolve in-doubt transactions. This patch still has the concern in the design and I'm planing to update the patch for the next commit fest. So I'll mark this as "Waiting on Author". Feedback and suggestion are very welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Based on the review comment from Robert, I'm planning to do the big > change to the architecture of this patch so that a backend process > work together with a dedicated background worker that is responsible > for resolving the foreign transactions. For the usage of this feature, > it will be almost the same as what this patch has been doing except > for adding a new GUC paramter that controls the number of resovler > process launch. That is, we can have multiple resolver process to keep > latency down. Multiple resolver processes is useful but gets a bit complicated. For example, if process 1 has a connection open to foreign server A and process 2 does not, and a request arrives that needs to be handled on foreign server A, what happens? If process 1 is already busy doing something else, probably we want process 2 to try to open a new connection to foreign server A and handle the request. But if process 1 and 2 are both idle, ideally we'd like 1 to get that request rather than 2. That seems a bit difficult to get working though. Maybe we should just ignore such considerations in the first version. > * Resovler processes > 1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a). > 2. Get the fdw_xact_state entry from shmem hash by XID-a. > 3. Iterate fdw_xact entries using the index, and resolve the foreign > transactions. > 3-a. If even one foreign transaction failed to resolve, raise an error. > 4. Change the waiting backend state to FDWXACT_COMPLETED and release it. Comments: - Note that any error we raise here won't reach the user; this is a background process. We don't want to get into a loop where we just error out repeatedly forever -- at least not if there's any other reasonable choice. - I suggest that we ought to track the status for each XID separately on each server rather than just track the XID status overall. That way, if transaction resolution fails on one server, we don't keep trying to reconnect to the others. - If we go to resolve a remote transaction and find that no such remote transaction exists, what should we do? I'm inclined to think that we should regard that as if we had succeeded in resolving the transaction. Certainly, if we've retried the server repeatedly, it might be that we previously succeeded in resolving the transaction but then the network connection was broken before we got the success message back from the remote server. But even if that's not the scenario, I think we should assume that the DBA or some other system resolved it and therefore we don't need to do anything further. If we assume anything else, then we just go into an infinite error loop, which isn't useful behavior. We could log a message, though (for example, LOG: unable to resolve foreign transaction ... because it does not exist). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Based on the review comment from Robert, I'm planning to do the big >> change to the architecture of this patch so that a backend process >> work together with a dedicated background worker that is responsible >> for resolving the foreign transactions. For the usage of this feature, >> it will be almost the same as what this patch has been doing except >> for adding a new GUC paramter that controls the number of resovler >> process launch. That is, we can have multiple resolver process to keep >> latency down. > > Multiple resolver processes is useful but gets a bit complicated. For > example, if process 1 has a connection open to foreign server A and > process 2 does not, and a request arrives that needs to be handled on > foreign server A, what happens? If process 1 is already busy doing > something else, probably we want process 2 to try to open a new > connection to foreign server A and handle the request. But if process > 1 and 2 are both idle, ideally we'd like 1 to get that request rather > than 2. That seems a bit difficult to get working though. Maybe we > should just ignore such considerations in the first version. I understood. I keep it simple in the first version. >> * Resovler processes >> 1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a). >> 2. Get the fdw_xact_state entry from shmem hash by XID-a. >> 3. Iterate fdw_xact entries using the index, and resolve the foreign >> transactions. >> 3-a. If even one foreign transaction failed to resolve, raise an error. >> 4. Change the waiting backend state to FDWXACT_COMPLETED and release it. > > Comments: > > - Note that any error we raise here won't reach the user; this is a > background process. We don't want to get into a loop where we just > error out repeatedly forever -- at least not if there's any other > reasonable choice. Thank you for the comments. Agreed. > - I suggest that we ought to track the status for each XID separately > on each server rather than just track the XID status overall. That > way, if transaction resolution fails on one server, we don't keep > trying to reconnect to the others. Agreed. In the current patch we manage fdw_xact entries that track the status for each XID separately on each server. I'm going to use the same mechanism. The resolver process get an target XID from shmem queue and get the all fdw_xact entries associated with the XID from the fdw_xact array in shmem. But since the scanning the whole fdw_xact entries could be slow because the number of entry of fdw_xact array could be a large number (e.g, max_connections * # of foreign servers),I'm considering to have a linked list of the all fdw_xactentries associated with same XID, and to have a shmem hash pointing to the first fdw_xact entry of the linked lists for each XID. That way, we can find the target fdw_xact entries from the array in O(1). > - If we go to resolve a remote transaction and find that no such > remote transaction exists, what should we do? I'm inclined to think > that we should regard that as if we had succeeded in resolving the > transaction. Certainly, if we've retried the server repeatedly, it > might be that we previously succeeded in resolving the transaction but > then the network connection was broken before we got the success > message back from the remote server. But even if that's not the > scenario, I think we should assume that the DBA or some other system > resolved it and therefore we don't need to do anything further. If we > assume anything else, then we just go into an infinite error loop, > which isn't useful behavior. We could log a message, though (for > example, LOG: unable to resolve foreign transaction ... because it > does not exist). Agreed. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 27, 2017 at 12:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Based on the review comment from Robert, I'm planning to do the big >>> change to the architecture of this patch so that a backend process >>> work together with a dedicated background worker that is responsible >>> for resolving the foreign transactions. For the usage of this feature, >>> it will be almost the same as what this patch has been doing except >>> for adding a new GUC paramter that controls the number of resovler >>> process launch. That is, we can have multiple resolver process to keep >>> latency down. >> >> Multiple resolver processes is useful but gets a bit complicated. For >> example, if process 1 has a connection open to foreign server A and >> process 2 does not, and a request arrives that needs to be handled on >> foreign server A, what happens? If process 1 is already busy doing >> something else, probably we want process 2 to try to open a new >> connection to foreign server A and handle the request. But if process >> 1 and 2 are both idle, ideally we'd like 1 to get that request rather >> than 2. That seems a bit difficult to get working though. Maybe we >> should just ignore such considerations in the first version. > > I understood. I keep it simple in the first version. While a resolver process is useful for resolving transaction later, it seems performance effective to try to resolve the prepared foreign transaction, in post-commit phase, in the same backend which prepared those for two reasons 1. the backend already has a connection to that foreign server 2. it has just run some commands to completion on that foreign server, so it's highly likely that a COMMIT PREPARED would succeed too. If we let a resolver process do that, we will spend time in 1. signalling resolver process 2. setting up a connection to the foreign server and 3. by the time resolver process tries to resolve the prepared transaction the foreign server may become unavailable, thus delaying the resolution. Said that, I agree that post-commit phase doesn't have a transaction of itself, and thus any catalog lookup, error reporting is not possible. We will need some different approach here, which may not be straight forward. So, we may need to delay this optimization for v2. I think we have discussed this before, but I don't find a mail off-hand. > >>> * Resovler processes >>> 1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a). >>> 2. Get the fdw_xact_state entry from shmem hash by XID-a. >>> 3. Iterate fdw_xact entries using the index, and resolve the foreign >>> transactions. >>> 3-a. If even one foreign transaction failed to resolve, raise an error. >>> 4. Change the waiting backend state to FDWXACT_COMPLETED and release it. >> >> Comments: >> >> - Note that any error we raise here won't reach the user; this is a >> background process. We don't want to get into a loop where we just >> error out repeatedly forever -- at least not if there's any other >> reasonable choice. > > Thank you for the comments. > > Agreed. We should probably log an error message in the server log, so that DBAs are aware of such a failure. Is that something you are considering to do? > >> - I suggest that we ought to track the status for each XID separately >> on each server rather than just track the XID status overall. That >> way, if transaction resolution fails on one server, we don't keep >> trying to reconnect to the others. > > Agreed. In the current patch we manage fdw_xact entries that track the > status for each XID separately on each server. I'm going to use the > same mechanism. The resolver process get an target XID from shmem > queue and get the all fdw_xact entries associated with the XID from > the fdw_xact array in shmem. But since the scanning the whole fdw_xact > entries could be slow because the number of entry of fdw_xact array > could be a large number (e.g, max_connections * # of foreign servers), > I'm considering to have a linked list of the all fdw_xact entries > associated with same XID, and to have a shmem hash pointing to the > first fdw_xact entry of the linked lists for each XID. That way, we > can find the target fdw_xact entries from the array in O(1). > If we want to do something like this, would it be useful to use a data structure similar to what is used for maintaining subtrasactions? Just a thought. >> - If we go to resolve a remote transaction and find that no such >> remote transaction exists, what should we do? I'm inclined to think >> that we should regard that as if we had succeeded in resolving the >> transaction. Certainly, if we've retried the server repeatedly, it >> might be that we previously succeeded in resolving the transaction but >> then the network connection was broken before we got the success >> message back from the remote server. But even if that's not the >> scenario, I think we should assume that the DBA or some other system >> resolved it and therefore we don't need to do anything further. If we >> assume anything else, then we just go into an infinite error loop, >> which isn't useful behavior. We could log a message, though (for >> example, LOG: unable to resolve foreign transaction ... because it >> does not exist). > > Agreed. > Yes. I think the current patch takes care of this, except probably the error message. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> On 26 Sep 2017, at 12:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Based on the review comment from Robert, I'm planning to do the big > change to the architecture of this patch so that a backend process > work together with a dedicated background worker that is responsible > for resolving the foreign transactions. For what it worth, I rebased latest patch to current master. As far as I understand it is planned to change resolver arch, so is it okay to review code that is intended for non-faulty work scenarios? Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Sep 27, 2017 at 4:05 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Sep 27, 2017 at 12:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Based on the review comment from Robert, I'm planning to do the big >>>> change to the architecture of this patch so that a backend process >>>> work together with a dedicated background worker that is responsible >>>> for resolving the foreign transactions. For the usage of this feature, >>>> it will be almost the same as what this patch has been doing except >>>> for adding a new GUC paramter that controls the number of resovler >>>> process launch. That is, we can have multiple resolver process to keep >>>> latency down. >>> >>> Multiple resolver processes is useful but gets a bit complicated. For >>> example, if process 1 has a connection open to foreign server A and >>> process 2 does not, and a request arrives that needs to be handled on >>> foreign server A, what happens? If process 1 is already busy doing >>> something else, probably we want process 2 to try to open a new >>> connection to foreign server A and handle the request. But if process >>> 1 and 2 are both idle, ideally we'd like 1 to get that request rather >>> than 2. That seems a bit difficult to get working though. Maybe we >>> should just ignore such considerations in the first version. >> >> I understood. I keep it simple in the first version. > > While a resolver process is useful for resolving transaction later, it > seems performance effective to try to resolve the prepared foreign > transaction, in post-commit phase, in the same backend which prepared > those for two reasons 1. the backend already has a connection to that > foreign server 2. it has just run some commands to completion on that > foreign server, so it's highly likely that a COMMIT PREPARED would > succeed too. If we let a resolver process do that, we will spend time > in 1. signalling resolver process 2. setting up a connection to the > foreign server and 3. by the time resolver process tries to resolve > the prepared transaction the foreign server may become unavailable, > thus delaying the resolution. I think that making a resolver process have connection caches to each foreign server for a while can reduce the overhead of connection to foreign servers. These connections will be invalidated by DDLs. Also, most of the time we spend to commit a distributed transaction is the interaction between the coordinator and foreign servers using two-phase commit protocal. So I guess the time in signalling to a resolver process would not be a big overhead. > Said that, I agree that post-commit phase doesn't have a transaction > of itself, and thus any catalog lookup, error reporting is not > possible. We will need some different approach here, which may not be > straight forward. So, we may need to delay this optimization for v2. I > think we have discussed this before, but I don't find a mail off-hand. > >> >>>> * Resovler processes >>>> 1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a). >>>> 2. Get the fdw_xact_state entry from shmem hash by XID-a. >>>> 3. Iterate fdw_xact entries using the index, and resolve the foreign >>>> transactions. >>>> 3-a. If even one foreign transaction failed to resolve, raise an error. >>>> 4. Change the waiting backend state to FDWXACT_COMPLETED and release it. >>> >>> Comments: >>> >>> - Note that any error we raise here won't reach the user; this is a >>> background process. We don't want to get into a loop where we just >>> error out repeatedly forever -- at least not if there's any other >>> reasonable choice. >> >> Thank you for the comments. >> >> Agreed. > > We should probably log an error message in the server log, so that > DBAs are aware of such a failure. Is that something you are > considering to do? Yes, a resolver process logs an error message in that case. > >> >>> - I suggest that we ought to track the status for each XID separately >>> on each server rather than just track the XID status overall. That >>> way, if transaction resolution fails on one server, we don't keep >>> trying to reconnect to the others. >> >> Agreed. In the current patch we manage fdw_xact entries that track the >> status for each XID separately on each server. I'm going to use the >> same mechanism. The resolver process get an target XID from shmem >> queue and get the all fdw_xact entries associated with the XID from >> the fdw_xact array in shmem. But since the scanning the whole fdw_xact >> entries could be slow because the number of entry of fdw_xact array >> could be a large number (e.g, max_connections * # of foreign servers), >> I'm considering to have a linked list of the all fdw_xact entries >> associated with same XID, and to have a shmem hash pointing to the >> first fdw_xact entry of the linked lists for each XID. That way, we >> can find the target fdw_xact entries from the array in O(1). >> > > If we want to do something like this, would it be useful to use a data > structure similar to what is used for maintaining subtrasactions? Just > a thought. Thank you for the advise, I'll consider that. But what I want to do is just grouping the fdw_xact entries by XID and fetching the group of fdw_xact in O(1) so we might not need to have the group as using a stack like that is used for maintaining subtransactions. > >>> - If we go to resolve a remote transaction and find that no such >>> remote transaction exists, what should we do? I'm inclined to think >>> that we should regard that as if we had succeeded in resolving the >>> transaction. Certainly, if we've retried the server repeatedly, it >>> might be that we previously succeeded in resolving the transaction but >>> then the network connection was broken before we got the success >>> message back from the remote server. But even if that's not the >>> scenario, I think we should assume that the DBA or some other system >>> resolved it and therefore we don't need to do anything further. If we >>> assume anything else, then we just go into an infinite error loop, >>> which isn't useful behavior. We could log a message, though (for >>> example, LOG: unable to resolve foreign transaction ... because it >>> does not exist). >> >> Agreed. >> > > Yes. I think the current patch takes care of this, except probably the > error message. > Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I think that making a resolver process have connection caches to each > foreign server for a while can reduce the overhead of connection to > foreign servers. These connections will be invalidated by DDLs. Also, > most of the time we spend to commit a distributed transaction is the > interaction between the coordinator and foreign servers using > two-phase commit protocal. So I guess the time in signalling to a > resolver process would not be a big overhead. I agree. Also, in the future, we might try to allow connections to be shared across backends. I did some research on this a number of years ago and found that every operating system I investigated had some way of passing a file descriptor from one process to another -- so a shared connection cache might be possible. Also, we might port the whole backend to use threads, and then this problem goes way. But I don't have time to write that patch this week. :-) It's possible that we might find that neither of the above approaches are practical and that the performance benefits of resolving the transaction from the original connection are large enough that we want to try to make it work anyhow. However, I think we can postpone that work to a future time. Any general solution to this problem at least needs to be ABLE to resolve transactions at a later time from a different session, so let's get that working first, and then see what else we want to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I think that making a resolver process have connection caches to each >> foreign server for a while can reduce the overhead of connection to >> foreign servers. These connections will be invalidated by DDLs. Also, >> most of the time we spend to commit a distributed transaction is the >> interaction between the coordinator and foreign servers using >> two-phase commit protocal. So I guess the time in signalling to a >> resolver process would not be a big overhead. > > I agree. Also, in the future, we might try to allow connections to be > shared across backends. I did some research on this a number of years > ago and found that every operating system I investigated had some way > of passing a file descriptor from one process to another -- so a > shared connection cache might be possible. It sounds good idea. > Also, we might port the whole backend to use threads, and then this > problem goes way. But I don't have time to write that patch this > week. :-) > > It's possible that we might find that neither of the above approaches > are practical and that the performance benefits of resolving the > transaction from the original connection are large enough that we want > to try to make it work anyhow. However, I think we can postpone that > work to a future time. Any general solution to this problem at least > needs to be ABLE to resolve transactions at a later time from a > different session, so let's get that working first, and then see what > else we want to do. > I understood and agreed. I'll post the first version patch of new design to next CF. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
> On 02 Oct 2017, at 08:31, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> I think that making a resolver process have connection caches to each >>> foreign server for a while can reduce the overhead of connection to >>> foreign servers. These connections will be invalidated by DDLs. Also, >>> most of the time we spend to commit a distributed transaction is the >>> interaction between the coordinator and foreign servers using >>> two-phase commit protocal. So I guess the time in signalling to a >>> resolver process would not be a big overhead. >> >> I agree. Also, in the future, we might try to allow connections to be >> shared across backends. I did some research on this a number of years >> ago and found that every operating system I investigated had some way >> of passing a file descriptor from one process to another -- so a >> shared connection cache might be possible. > > It sounds good idea. > >> Also, we might port the whole backend to use threads, and then this >> problem goes way. But I don't have time to write that patch this >> week. :-) >> >> It's possible that we might find that neither of the above approaches >> are practical and that the performance benefits of resolving the >> transaction from the original connection are large enough that we want >> to try to make it work anyhow. However, I think we can postpone that >> work to a future time. Any general solution to this problem at least >> needs to be ABLE to resolve transactions at a later time from a >> different session, so let's get that working first, and then see what >> else we want to do. > > I understood and agreed. I'll post the first version patch of new > design to next CF. Closing this patch with Returned with feedback in this commitfest, looking forward to a new version in an upcoming commitfest. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 29, 2017 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > It's possible that we might find that neither of the above approaches > are practical and that the performance benefits of resolving the > transaction from the original connection are large enough that we want > to try to make it work anyhow. However, I think we can postpone that > work to a future time. Any general solution to this problem at least > needs to be ABLE to resolve transactions at a later time from a > different session, so let's get that working first, and then see what > else we want to do. > +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 2, 2017 at 3:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> I think that making a resolver process have connection caches to each >>> foreign server for a while can reduce the overhead of connection to >>> foreign servers. These connections will be invalidated by DDLs. Also, >>> most of the time we spend to commit a distributed transaction is the >>> interaction between the coordinator and foreign servers using >>> two-phase commit protocal. So I guess the time in signalling to a >>> resolver process would not be a big overhead. >> >> I agree. Also, in the future, we might try to allow connections to be >> shared across backends. I did some research on this a number of years >> ago and found that every operating system I investigated had some way >> of passing a file descriptor from one process to another -- so a >> shared connection cache might be possible. > > It sounds good idea. > >> Also, we might port the whole backend to use threads, and then this >> problem goes way. But I don't have time to write that patch this >> week. :-) >> >> It's possible that we might find that neither of the above approaches >> are practical and that the performance benefits of resolving the >> transaction from the original connection are large enough that we want >> to try to make it work anyhow. However, I think we can postpone that >> work to a future time. Any general solution to this problem at least >> needs to be ABLE to resolve transactions at a later time from a >> different session, so let's get that working first, and then see what >> else we want to do. >> > > I understood and agreed. I'll post the first version patch of new > design to next CF. > Attached latest version patch. I've heavily changed the patch since previous one. The most part I modified is the resolving foreign transaction and handling of dangling transactions. The part of management of fdwxact entries is almost same as the previous patch. Foreign Transaction Resolver ====================== I introduced a new background worker called "foreign transaction resolver" which is responsible for resolving the transaction prepared on foreign servers. The foreign transaction resolver process is launched by backend processes when commit/rollback transaction. And it periodically resolves the queued transactions on a database as long as the queue is not empty. If the queue has been empty for the certain time specified by foreign_transaction_resolver_time GUC parameter, it exits. It means that the backend doesn't launch a new resolver process if the resolver process is already working. In this case, the backend process just adds the entry to the queue on shared memory and wake it up. The maximum number of resolver process we can launch is controlled by max_foreign_transaction_resolvers. So we recommends to set larger max_foreign_transaction_resolvers value than the number of databases. The resolver process also tries to resolve dangling transaction as well in a cycle. Processing Sequence ================= I've changed the processing sequence of resolving foreign transaction so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK prepared) is executed by a resolver process, not by backend process. The basic processing sequence is following; * Backend process 1. In pre-commit phase, the backend process saves fdwxact entries, and then prepares transaction on all foreign servers that can execute two-phase commit protocol. 2. Local commit. 3. Enqueue itself to the shmem queue and change its status to WAITING 4. launch or wakeup a resolver process and wait * Resolver process 1. Dequeue the waiting process from shmem qeue 2. Collect the fdwxact entries that are associated with the waiting process. 3. Resolve foreign transactoins 4. Release the waiting process 5. Wake up and restart This is still under the design phase and I'm sure that there is room for improvement and consider more sensitive behaviour but I'd like to share the current status of the patch. The patch includes regression tests but not includes fully documentation. Feedback and comment are very welcome. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Oct 25, 2017 at 3:15 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Foreign Transaction Resolver > ====================== > I introduced a new background worker called "foreign transaction > resolver" which is responsible for resolving the transaction prepared > on foreign servers. The foreign transaction resolver process is > launched by backend processes when commit/rollback transaction. And it > periodically resolves the queued transactions on a database as long as > the queue is not empty. If the queue has been empty for the certain > time specified by foreign_transaction_resolver_time GUC parameter, it > exits. It means that the backend doesn't launch a new resolver process > if the resolver process is already working. In this case, the backend > process just adds the entry to the queue on shared memory and wake it > up. The maximum number of resolver process we can launch is controlled > by max_foreign_transaction_resolvers. So we recommends to set larger > max_foreign_transaction_resolvers value than the number of databases. > The resolver process also tries to resolve dangling transaction as > well in a cycle. > > Processing Sequence > ================= > I've changed the processing sequence of resolving foreign transaction > so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK > prepared) is executed by a resolver process, not by backend process. > The basic processing sequence is following; > > * Backend process > 1. In pre-commit phase, the backend process saves fdwxact entries, and > then prepares transaction on all foreign servers that can execute > two-phase commit protocol. > 2. Local commit. > 3. Enqueue itself to the shmem queue and change its status to WAITING > 4. launch or wakeup a resolver process and wait > > * Resolver process > 1. Dequeue the waiting process from shmem qeue > 2. Collect the fdwxact entries that are associated with the waiting process. > 3. Resolve foreign transactoins > 4. Release the waiting process Why do we want the the backend to linger behind, once it has added its foreign transaction entries in the shared memory and informed resolver about it? The foreign connections may take their own time and even after that there is no guarantee that the foreign transactions will be resolved in case the foreign server is not available. So, why to make the backend wait? > > 5. Wake up and restart > > This is still under the design phase and I'm sure that there is room > for improvement and consider more sensitive behaviour but I'd like to > share the current status of the patch. The patch includes regression > tests but not includes fully documentation. Any background worker, backend should be child of the postmaster, so we should not let a backend start a resolver process. It should be the job of the postmaster. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 26, 2017 at 2:36 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Wed, Oct 25, 2017 at 3:15 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> Foreign Transaction Resolver >> ====================== >> I introduced a new background worker called "foreign transaction >> resolver" which is responsible for resolving the transaction prepared >> on foreign servers. The foreign transaction resolver process is >> launched by backend processes when commit/rollback transaction. And it >> periodically resolves the queued transactions on a database as long as >> the queue is not empty. If the queue has been empty for the certain >> time specified by foreign_transaction_resolver_time GUC parameter, it >> exits. It means that the backend doesn't launch a new resolver process >> if the resolver process is already working. In this case, the backend >> process just adds the entry to the queue on shared memory and wake it >> up. The maximum number of resolver process we can launch is controlled >> by max_foreign_transaction_resolvers. So we recommends to set larger >> max_foreign_transaction_resolvers value than the number of databases. >> The resolver process also tries to resolve dangling transaction as >> well in a cycle. >> >> Processing Sequence >> ================= >> I've changed the processing sequence of resolving foreign transaction >> so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK >> prepared) is executed by a resolver process, not by backend process. >> The basic processing sequence is following; >> >> * Backend process >> 1. In pre-commit phase, the backend process saves fdwxact entries, and >> then prepares transaction on all foreign servers that can execute >> two-phase commit protocol. >> 2. Local commit. >> 3. Enqueue itself to the shmem queue and change its status to WAITING >> 4. launch or wakeup a resolver process and wait >> >> * Resolver process >> 1. Dequeue the waiting process from shmem qeue >> 2. Collect the fdwxact entries that are associated with the waiting process. >> 3. Resolve foreign transactoins >> 4. Release the waiting process > > Why do we want the the backend to linger behind, once it has added its > foreign transaction entries in the shared memory and informed resolver > about it? The foreign connections may take their own time and even > after that there is no guarantee that the foreign transactions will be > resolved in case the foreign server is not available. So, why to make > the backend wait? Because I don't want to break the current user semantics. that is, currently it's guaranteed that the subsequent reads can see the committed result of previous writes even if the previous transactions were distributed transactions. And it's ensured by writer side. If we can make the reader side ensure it, the backend process don't need to wait for the resolver process. The waiting backend process are released by resolver process after the resolver process tried to resolve foreign transactions. Even if resolver process failed to either connect to foreign server or to resolve foreign transaction the backend process will be released and the foreign transactions are leaved as dangling transaction in that case, which are processed later. Also if resolver process takes a long time to resolve foreign transactions for whatever reason the user can cancel it by Ctl-c anytime. >> >> 5. Wake up and restart >> >> This is still under the design phase and I'm sure that there is room >> for improvement and consider more sensitive behaviour but I'd like to >> share the current status of the patch. The patch includes regression >> tests but not includes fully documentation. > > Any background worker, backend should be child of the postmaster, so > we should not let a backend start a resolver process. It should be the > job of the postmaster. > Of course I won't. I used the term of "the backend process launches the resolver process" for explaining easier. Sorry for confusing you. The backend process calls RegisterDynamicBackgroundWorker() function to launch a resolver process, so they are launched by postmaster. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 26, 2017 at 4:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Why do we want the the backend to linger behind, once it has added its >> foreign transaction entries in the shared memory and informed resolver >> about it? The foreign connections may take their own time and even >> after that there is no guarantee that the foreign transactions will be >> resolved in case the foreign server is not available. So, why to make >> the backend wait? > > Because I don't want to break the current user semantics. that is, > currently it's guaranteed that the subsequent reads can see the > committed result of previous writes even if the previous transactions > were distributed transactions. Right, this is very important, and having the backend wait for the resolver(s) is, I think, the right way to implement it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Because I don't want to break the current user semantics. that is, > currently it's guaranteed that the subsequent reads can see the > committed result of previous writes even if the previous transactions > were distributed transactions. And it's ensured by writer side. If we > can make the reader side ensure it, the backend process don't need to > wait for the resolver process. > > The waiting backend process are released by resolver process after the > resolver process tried to resolve foreign transactions. Even if > resolver process failed to either connect to foreign server or to > resolve foreign transaction the backend process will be released and > the foreign transactions are leaved as dangling transaction in that > case, which are processed later. Also if resolver process takes a long > time to resolve foreign transactions for whatever reason the user can > cancel it by Ctl-c anytime. > So, there's no guarantee that the next command issued from the connection *will* see the committed data, since the foreign transaction might not have committed because of a network glitch (say). If we go this route of making backends wait for resolver to resolve the foreign transaction, we will have add complexity to make sure that the waiting backends are woken up in problematic events like crash of the resolver process OR if the resolver process hangs in a connection to a foreign server etc. I am not sure that the complexity is worth the half-guarantee. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> Because I don't want to break the current user semantics. that is, >> currently it's guaranteed that the subsequent reads can see the >> committed result of previous writes even if the previous transactions >> were distributed transactions. And it's ensured by writer side. If we >> can make the reader side ensure it, the backend process don't need to >> wait for the resolver process. >> >> The waiting backend process are released by resolver process after the >> resolver process tried to resolve foreign transactions. Even if >> resolver process failed to either connect to foreign server or to >> resolve foreign transaction the backend process will be released and >> the foreign transactions are leaved as dangling transaction in that >> case, which are processed later. Also if resolver process takes a long >> time to resolve foreign transactions for whatever reason the user can >> cancel it by Ctl-c anytime. >> > > So, there's no guarantee that the next command issued from the > connection *will* see the committed data, since the foreign > transaction might not have committed because of a network glitch > (say). If we go this route of making backends wait for resolver to > resolve the foreign transaction, we will have add complexity to make > sure that the waiting backends are woken up in problematic events like > crash of the resolver process OR if the resolver process hangs in a > connection to a foreign server etc. I am not sure that the complexity > is worth the half-guarantee. > Hmm, maybe I was wrong. I now think that the waiting backends can be woken up only in following cases; - The resolver process succeeded to resolve all foreign transactions. - The user did the cancel (e.g. ctl-c). - The resolver process failed to resolve foreign transaction for a reason of there is no such prepared transaction on foreign server. In other cases the resolver process should not release the waiters. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: > > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >> Because I don't want to break the current user semantics. that is, > >> currently it's guaranteed that the subsequent reads can see the > >> committed result of previous writes even if the previous transactions > >> were distributed transactions. And it's ensured by writer side. If we > >> can make the reader side ensure it, the backend process don't need to > >> wait for the resolver process. > >> > >> The waiting backend process are released by resolver process after the > >> resolver process tried to resolve foreign transactions. Even if > >> resolver process failed to either connect to foreign server or to > >> resolve foreign transaction the backend process will be released and > >> the foreign transactions are leaved as dangling transaction in that > >> case, which are processed later. Also if resolver process takes a long > >> time to resolve foreign transactions for whatever reason the user can > >> cancel it by Ctl-c anytime. > >> > > > > So, there's no guarantee that the next command issued from the > > connection *will* see the committed data, since the foreign > > transaction might not have committed because of a network glitch > > (say). If we go this route of making backends wait for resolver to > > resolve the foreign transaction, we will have add complexity to make > > sure that the waiting backends are woken up in problematic events like > > crash of the resolver process OR if the resolver process hangs in a > > connection to a foreign server etc. I am not sure that the complexity > > is worth the half-guarantee. > > > > Hmm, maybe I was wrong. I now think that the waiting backends can be > woken up only in following cases; > - The resolver process succeeded to resolve all foreign transactions. > - The user did the cancel (e.g. ctl-c). > - The resolver process failed to resolve foreign transaction for a > reason of there is no such prepared transaction on foreign server. > > In other cases the resolver process should not release the waiters. I'm not sure I see consensus here. What Ashutosh says seems to be: "Special effort is needed to ensure that backend does not keep waiting if the resolver can't finish it's work in forseable future. But this effort is not worth because by waking the backend up you might prevent the next transaction from seeing the changes the previous one tried to make." On the other hand, your last comments indicate that you try to be even more stringent in letting the backend wait. However even this stringent approach does not guarantee that the next transaction will see the data changes made by the previous one. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote: > Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat >> <ashutosh.bapat@enterprisedb.com> wrote: >> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> >> >> Because I don't want to break the current user semantics. that is, >> >> currently it's guaranteed that the subsequent reads can see the >> >> committed result of previous writes even if the previous transactions >> >> were distributed transactions. And it's ensured by writer side. If we >> >> can make the reader side ensure it, the backend process don't need to >> >> wait for the resolver process. >> >> >> >> The waiting backend process are released by resolver process after the >> >> resolver process tried to resolve foreign transactions. Even if >> >> resolver process failed to either connect to foreign server or to >> >> resolve foreign transaction the backend process will be released and >> >> the foreign transactions are leaved as dangling transaction in that >> >> case, which are processed later. Also if resolver process takes a long >> >> time to resolve foreign transactions for whatever reason the user can >> >> cancel it by Ctl-c anytime. >> >> >> > >> > So, there's no guarantee that the next command issued from the >> > connection *will* see the committed data, since the foreign >> > transaction might not have committed because of a network glitch >> > (say). If we go this route of making backends wait for resolver to >> > resolve the foreign transaction, we will have add complexity to make >> > sure that the waiting backends are woken up in problematic events like >> > crash of the resolver process OR if the resolver process hangs in a >> > connection to a foreign server etc. I am not sure that the complexity >> > is worth the half-guarantee. >> > >> >> Hmm, maybe I was wrong. I now think that the waiting backends can be >> woken up only in following cases; >> - The resolver process succeeded to resolve all foreign transactions. >> - The user did the cancel (e.g. ctl-c). >> - The resolver process failed to resolve foreign transaction for a >> reason of there is no such prepared transaction on foreign server. >> >> In other cases the resolver process should not release the waiters. > > I'm not sure I see consensus here. What Ashutosh says seems to be: "Special > effort is needed to ensure that backend does not keep waiting if the resolver > can't finish it's work in forseable future. But this effort is not worth > because by waking the backend up you might prevent the next transaction from > seeing the changes the previous one tried to make." > > On the other hand, your last comments indicate that you try to be even more > stringent in letting the backend wait. However even this stringent approach > does not guarantee that the next transaction will see the data changes made by > the previous one. > What I'd like to guarantee is that the subsequent read can see the committed result of previous writes if the transaction involving multiple foreign servers is committed without cancellation by user. In other words, the backend should not be waken up and the resolver should continue to resolve at certain intervals even if the resolver fails to connect to the foreign server or fails to resolve it. This is similar to what synchronous replication guaranteed today. Keeping this semantics is very important for users. Note that the reading a consistent result by concurrent reads is a separated problem. The read result including foreign servers can be inconsistent if the such transaction is cancelled or the coordinator server crashes during two-phase commit processing. That is, if there is in-doubt transaction the read result can be inconsistent, even for subsequent reads. But I think this behaviour can be accepted by users. For the resolution of in-doubt transactions, the resolver process will try to resolve such transactions after the coordinator server recovered. On the other hand, for the reading a consistent result on such situation by subsequent reads, for example, we can disallow backends to inquiry SQL to the foreign server if a foreign transaction of the foreign server is remained. For the concurrent reads, the reading an inconsistent result can be happen even without in-doubt transaction because we can read data on a foreign server between PREPARE and COMMIT PREPARED while other foreign servers have committed. I think we should deal with this problem by other feature on reader side, for example, atomic visibility. If we have atomic visibility feature, we also can solve the above problem. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Nov 27, 2017 at 4:35 PM Masahiko Sawada
wrote:
> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user. In
> other words, the backend should not be waken up and the resolver
> should continue to resolve at certain intervals even if the resolver
> fails to connect to the foreign server or fails to resolve it. This is
> similar to what synchronous replication guaranteed today. Keeping this
> semantics is very important for users. Note that the reading a
> consistent result by concurrent reads is a separated problem.
>
> The read result including foreign servers can be inconsistent if the
> such transaction is cancelled or the coordinator server crashes during
> two-phase commit processing. That is, if there is in-doubt transaction
> the read result can be inconsistent, even for subsequent reads. But I
> think this behaviour can be accepted by users. For the resolution of
> in-doubt transactions, the resolver process will try to resolve such
> transactions after the coordinator server recovered. On the other
> hand, for the reading a consistent result on such situation by
> subsequent reads, for example, we can disallow backends to inquiry SQL
> to the foreign server if a foreign transaction of the foreign server
> is remained.
>
> For the concurrent reads, the reading an inconsistent result can be
> happen even without in-doubt transaction because we can read data on a
> foreign server between PREPARE and COMMIT PREPARED while other foreign
> servers have committed. I think we should deal with this problem by
> other feature on reader side, for example, atomic visibility. If we
> have atomic visibility feature, we also can solve the above problem.
+1 to all of that.
...Robert
>
> --
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote: >> Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat >>> <ashutosh.bapat@enterprisedb.com> wrote: >>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >> >>> >> Because I don't want to break the current user semantics. that is, >>> >> currently it's guaranteed that the subsequent reads can see the >>> >> committed result of previous writes even if the previous transactions >>> >> were distributed transactions. And it's ensured by writer side. If we >>> >> can make the reader side ensure it, the backend process don't need to >>> >> wait for the resolver process. >>> >> >>> >> The waiting backend process are released by resolver process after the >>> >> resolver process tried to resolve foreign transactions. Even if >>> >> resolver process failed to either connect to foreign server or to >>> >> resolve foreign transaction the backend process will be released and >>> >> the foreign transactions are leaved as dangling transaction in that >>> >> case, which are processed later. Also if resolver process takes a long >>> >> time to resolve foreign transactions for whatever reason the user can >>> >> cancel it by Ctl-c anytime. >>> >> >>> > >>> > So, there's no guarantee that the next command issued from the >>> > connection *will* see the committed data, since the foreign >>> > transaction might not have committed because of a network glitch >>> > (say). If we go this route of making backends wait for resolver to >>> > resolve the foreign transaction, we will have add complexity to make >>> > sure that the waiting backends are woken up in problematic events like >>> > crash of the resolver process OR if the resolver process hangs in a >>> > connection to a foreign server etc. I am not sure that the complexity >>> > is worth the half-guarantee. >>> > >>> >>> Hmm, maybe I was wrong. I now think that the waiting backends can be >>> woken up only in following cases; >>> - The resolver process succeeded to resolve all foreign transactions. >>> - The user did the cancel (e.g. ctl-c). >>> - The resolver process failed to resolve foreign transaction for a >>> reason of there is no such prepared transaction on foreign server. >>> >>> In other cases the resolver process should not release the waiters. >> >> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special >> effort is needed to ensure that backend does not keep waiting if the resolver >> can't finish it's work in forseable future. But this effort is not worth >> because by waking the backend up you might prevent the next transaction from >> seeing the changes the previous one tried to make." >> >> On the other hand, your last comments indicate that you try to be even more >> stringent in letting the backend wait. However even this stringent approach >> does not guarantee that the next transaction will see the data changes made by >> the previous one. >> > > What I'd like to guarantee is that the subsequent read can see the > committed result of previous writes if the transaction involving > multiple foreign servers is committed without cancellation by user. In > other words, the backend should not be waken up and the resolver > should continue to resolve at certain intervals even if the resolver > fails to connect to the foreign server or fails to resolve it. This is > similar to what synchronous replication guaranteed today. Keeping this > semantics is very important for users. Note that the reading a > consistent result by concurrent reads is a separated problem. The question I have is how would we deal with a foreign server that is not available for longer duration due to crash, longer network outage etc. Example is the foreign server crashed/got disconnected after PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain blocked for much longer duration without user having an idea of what's going on. May be we should add some timeout. > > The read result including foreign servers can be inconsistent if the > such transaction is cancelled or the coordinator server crashes during > two-phase commit processing. That is, if there is in-doubt transaction > the read result can be inconsistent, even for subsequent reads. But I > think this behaviour can be accepted by users. For the resolution of > in-doubt transactions, the resolver process will try to resolve such > transactions after the coordinator server recovered. On the other > hand, for the reading a consistent result on such situation by > subsequent reads, for example, we can disallow backends to inquiry SQL > to the foreign server if a foreign transaction of the foreign server > is remained. +1 for the last sentence. If we do that, we don't need the backend to be blocked by resolver since a subsequent read accessing that foreign server would get an error and not inconsistent data. > > For the concurrent reads, the reading an inconsistent result can be > happen even without in-doubt transaction because we can read data on a > foreign server between PREPARE and COMMIT PREPARED while other foreign > servers have committed. I think we should deal with this problem by > other feature on reader side, for example, atomic visibility. If we > have atomic visibility feature, we also can solve the above problem. > +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote: > > Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > >> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat > >> <ashutosh.bapat@enterprisedb.com> wrote: > >> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> >> > >> >> Because I don't want to break the current user semantics. that is, > >> >> currently it's guaranteed that the subsequent reads can see the > >> >> committed result of previous writes even if the previous transactions > >> >> were distributed transactions. And it's ensured by writer side. If we > >> >> can make the reader side ensure it, the backend process don't need to > >> >> wait for the resolver process. > >> >> > >> >> The waiting backend process are released by resolver process after the > >> >> resolver process tried to resolve foreign transactions. Even if > >> >> resolver process failed to either connect to foreign server or to > >> >> resolve foreign transaction the backend process will be released and > >> >> the foreign transactions are leaved as dangling transaction in that > >> >> case, which are processed later. Also if resolver process takes a long > >> >> time to resolve foreign transactions for whatever reason the user can > >> >> cancel it by Ctl-c anytime. > >> >> > >> > > >> > So, there's no guarantee that the next command issued from the > >> > connection *will* see the committed data, since the foreign > >> > transaction might not have committed because of a network glitch > >> > (say). If we go this route of making backends wait for resolver to > >> > resolve the foreign transaction, we will have add complexity to make > >> > sure that the waiting backends are woken up in problematic events like > >> > crash of the resolver process OR if the resolver process hangs in a > >> > connection to a foreign server etc. I am not sure that the complexity > >> > is worth the half-guarantee. > >> > > >> > >> Hmm, maybe I was wrong. I now think that the waiting backends can be > >> woken up only in following cases; > >> - The resolver process succeeded to resolve all foreign transactions. > >> - The user did the cancel (e.g. ctl-c). > >> - The resolver process failed to resolve foreign transaction for a > >> reason of there is no such prepared transaction on foreign server. > >> > >> In other cases the resolver process should not release the waiters. > > > > I'm not sure I see consensus here. What Ashutosh says seems to be: "Special > > effort is needed to ensure that backend does not keep waiting if the resolver > > can't finish it's work in forseable future. But this effort is not worth > > because by waking the backend up you might prevent the next transaction from > > seeing the changes the previous one tried to make." > > > > On the other hand, your last comments indicate that you try to be even more > > stringent in letting the backend wait. However even this stringent approach > > does not guarantee that the next transaction will see the data changes made by > > the previous one. > > > > What I'd like to guarantee is that the subsequent read can see the > committed result of previous writes if the transaction involving > multiple foreign servers is committed without cancellation by user. I missed the point that user should not expect atomicity of the commit to be guaranteed if he has cancelled his request. The other things are clear to me, including the fact that atomic commit and atomic visibility will be implemented separately. Thanks. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On Tue, Nov 28, 2017 at 12:31 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote: >>> Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> >>>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat >>>> <ashutosh.bapat@enterprisedb.com> wrote: >>>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> >> >>>> >> Because I don't want to break the current user semantics. that is, >>>> >> currently it's guaranteed that the subsequent reads can see the >>>> >> committed result of previous writes even if the previous transactions >>>> >> were distributed transactions. And it's ensured by writer side. If we >>>> >> can make the reader side ensure it, the backend process don't need to >>>> >> wait for the resolver process. >>>> >> >>>> >> The waiting backend process are released by resolver process after the >>>> >> resolver process tried to resolve foreign transactions. Even if >>>> >> resolver process failed to either connect to foreign server or to >>>> >> resolve foreign transaction the backend process will be released and >>>> >> the foreign transactions are leaved as dangling transaction in that >>>> >> case, which are processed later. Also if resolver process takes a long >>>> >> time to resolve foreign transactions for whatever reason the user can >>>> >> cancel it by Ctl-c anytime. >>>> >> >>>> > >>>> > So, there's no guarantee that the next command issued from the >>>> > connection *will* see the committed data, since the foreign >>>> > transaction might not have committed because of a network glitch >>>> > (say). If we go this route of making backends wait for resolver to >>>> > resolve the foreign transaction, we will have add complexity to make >>>> > sure that the waiting backends are woken up in problematic events like >>>> > crash of the resolver process OR if the resolver process hangs in a >>>> > connection to a foreign server etc. I am not sure that the complexity >>>> > is worth the half-guarantee. >>>> > >>>> >>>> Hmm, maybe I was wrong. I now think that the waiting backends can be >>>> woken up only in following cases; >>>> - The resolver process succeeded to resolve all foreign transactions. >>>> - The user did the cancel (e.g. ctl-c). >>>> - The resolver process failed to resolve foreign transaction for a >>>> reason of there is no such prepared transaction on foreign server. >>>> >>>> In other cases the resolver process should not release the waiters. >>> >>> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special >>> effort is needed to ensure that backend does not keep waiting if the resolver >>> can't finish it's work in forseable future. But this effort is not worth >>> because by waking the backend up you might prevent the next transaction from >>> seeing the changes the previous one tried to make." >>> >>> On the other hand, your last comments indicate that you try to be even more >>> stringent in letting the backend wait. However even this stringent approach >>> does not guarantee that the next transaction will see the data changes made by >>> the previous one. >>> >> >> What I'd like to guarantee is that the subsequent read can see the >> committed result of previous writes if the transaction involving >> multiple foreign servers is committed without cancellation by user. In >> other words, the backend should not be waken up and the resolver >> should continue to resolve at certain intervals even if the resolver >> fails to connect to the foreign server or fails to resolve it. This is >> similar to what synchronous replication guaranteed today. Keeping this >> semantics is very important for users. Note that the reading a >> consistent result by concurrent reads is a separated problem. > > The question I have is how would we deal with a foreign server that is > not available for longer duration due to crash, longer network outage > etc. Example is the foreign server crashed/got disconnected after > PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain > blocked for much longer duration without user having an idea of what's > going on. May be we should add some timeout. After more thought, I agree with adding some timeout. I can image there are users who want the timeout, for example, who cannot accept even a few seconds latency. If the timeout occurs backend unlocks the foreign transactions and breaks the loop. The resolver process will keep to continue to resolve foreign transactions at certain interval. >> >> The read result including foreign servers can be inconsistent if the >> such transaction is cancelled or the coordinator server crashes during >> two-phase commit processing. That is, if there is in-doubt transaction >> the read result can be inconsistent, even for subsequent reads. But I >> think this behaviour can be accepted by users. For the resolution of >> in-doubt transactions, the resolver process will try to resolve such >> transactions after the coordinator server recovered. On the other >> hand, for the reading a consistent result on such situation by >> subsequent reads, for example, we can disallow backends to inquiry SQL >> to the foreign server if a foreign transaction of the foreign server >> is remained. > > +1 for the last sentence. If we do that, we don't need the backend to > be blocked by resolver since a subsequent read accessing that foreign > server would get an error and not inconsistent data. Yeah, however the disadvantage of this is that we manage foreign transactions per foreign servers. If a transaction that modified even one table is remained as a in-doubt transaction, we cannot issue any SQL that touches that foreign server. Can we occur an error at ExecInitForeignScan()? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> The question I have is how would we deal with a foreign server that is >> not available for longer duration due to crash, longer network outage >> etc. Example is the foreign server crashed/got disconnected after >> PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain >> blocked for much longer duration without user having an idea of what's >> going on. May be we should add some timeout. > > After more thought, I agree with adding some timeout. I can image > there are users who want the timeout, for example, who cannot accept > even a few seconds latency. If the timeout occurs backend unlocks the > foreign transactions and breaks the loop. The resolver process will > keep to continue to resolve foreign transactions at certain interval. I don't think a timeout is a very good idea. There is no timeout for synchronous replication and the issues here are similar. I will not try to block a patch adding a timeout, but I think it had better be disabled by default and have very clear documentation explaining why it's really dangerous. And this is why: with no timeout, you can count on being able to see the effects of your own previous transactions, unless at some point you sent a query cancel or got disconnected. With a timeout, you may or may not see the effects of your own previous transactions depending on whether or not you hit the timeout, which you have no sure way of knowing. >>> transactions after the coordinator server recovered. On the other >>> hand, for the reading a consistent result on such situation by >>> subsequent reads, for example, we can disallow backends to inquiry SQL >>> to the foreign server if a foreign transaction of the foreign server >>> is remained. >> >> +1 for the last sentence. If we do that, we don't need the backend to >> be blocked by resolver since a subsequent read accessing that foreign >> server would get an error and not inconsistent data. > > Yeah, however the disadvantage of this is that we manage foreign > transactions per foreign servers. If a transaction that modified even > one table is remained as a in-doubt transaction, we cannot issue any > SQL that touches that foreign server. Can we occur an error at > ExecInitForeignScan()? I really feel strongly we shouldn't complicate the initial patch with this kind of thing. Let's make it enough for this patch to guarantee that either all parts of the transaction commit eventually or they all abort eventually. Ensuring consistent visibility is a different and hard project, and if we try to do that now, this patch is not going to be done any time soon. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 13, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> The question I have is how would we deal with a foreign server that is >>> not available for longer duration due to crash, longer network outage >>> etc. Example is the foreign server crashed/got disconnected after >>> PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain >>> blocked for much longer duration without user having an idea of what's >>> going on. May be we should add some timeout. >> >> After more thought, I agree with adding some timeout. I can image >> there are users who want the timeout, for example, who cannot accept >> even a few seconds latency. If the timeout occurs backend unlocks the >> foreign transactions and breaks the loop. The resolver process will >> keep to continue to resolve foreign transactions at certain interval. > > I don't think a timeout is a very good idea. There is no timeout for > synchronous replication and the issues here are similar. I will not > try to block a patch adding a timeout, but I think it had better be > disabled by default and have very clear documentation explaining why > it's really dangerous. And this is why: with no timeout, you can > count on being able to see the effects of your own previous > transactions, unless at some point you sent a query cancel or got > disconnected. With a timeout, you may or may not see the effects of > your own previous transactions depending on whether or not you hit the > timeout, which you have no sure way of knowing. > >>>> transactions after the coordinator server recovered. On the other >>>> hand, for the reading a consistent result on such situation by >>>> subsequent reads, for example, we can disallow backends to inquiry SQL >>>> to the foreign server if a foreign transaction of the foreign server >>>> is remained. >>> >>> +1 for the last sentence. If we do that, we don't need the backend to >>> be blocked by resolver since a subsequent read accessing that foreign >>> server would get an error and not inconsistent data. >> >> Yeah, however the disadvantage of this is that we manage foreign >> transactions per foreign servers. If a transaction that modified even >> one table is remained as a in-doubt transaction, we cannot issue any >> SQL that touches that foreign server. Can we occur an error at >> ExecInitForeignScan()? > > I really feel strongly we shouldn't complicate the initial patch with > this kind of thing. Let's make it enough for this patch to guarantee > that either all parts of the transaction commit eventually or they all > abort eventually. Ensuring consistent visibility is a different and > hard project, and if we try to do that now, this patch is not going to > be done any time soon. > Thank you for the suggestion. I was really wondering if we should add a timeout to this feature. It's a common concern that we want to put a timeout at critical section. But currently we don't have such timeout to neither synchronous replication or writing WAL. I can image there will be users who want to a timeout for such cases but obviously it makes this feature more complex. Anyway, even if we add a timeout to this feature we can make it as a separated patch and feature. So I'd like to keep it simple as first step. This patch guarantees that the transaction commit or rollback on all foreign servers or not unless users doesn't cancel. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Dec 13, 2017 at 10:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Wed, Dec 13, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> The question I have is how would we deal with a foreign server that is >>>> not available for longer duration due to crash, longer network outage >>>> etc. Example is the foreign server crashed/got disconnected after >>>> PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain >>>> blocked for much longer duration without user having an idea of what's >>>> going on. May be we should add some timeout. >>> >>> After more thought, I agree with adding some timeout. I can image >>> there are users who want the timeout, for example, who cannot accept >>> even a few seconds latency. If the timeout occurs backend unlocks the >>> foreign transactions and breaks the loop. The resolver process will >>> keep to continue to resolve foreign transactions at certain interval. >> >> I don't think a timeout is a very good idea. There is no timeout for >> synchronous replication and the issues here are similar. I will not >> try to block a patch adding a timeout, but I think it had better be >> disabled by default and have very clear documentation explaining why >> it's really dangerous. And this is why: with no timeout, you can >> count on being able to see the effects of your own previous >> transactions, unless at some point you sent a query cancel or got >> disconnected. With a timeout, you may or may not see the effects of >> your own previous transactions depending on whether or not you hit the >> timeout, which you have no sure way of knowing. >> >>>>> transactions after the coordinator server recovered. On the other >>>>> hand, for the reading a consistent result on such situation by >>>>> subsequent reads, for example, we can disallow backends to inquiry SQL >>>>> to the foreign server if a foreign transaction of the foreign server >>>>> is remained. >>>> >>>> +1 for the last sentence. If we do that, we don't need the backend to >>>> be blocked by resolver since a subsequent read accessing that foreign >>>> server would get an error and not inconsistent data. >>> >>> Yeah, however the disadvantage of this is that we manage foreign >>> transactions per foreign servers. If a transaction that modified even >>> one table is remained as a in-doubt transaction, we cannot issue any >>> SQL that touches that foreign server. Can we occur an error at >>> ExecInitForeignScan()? >> >> I really feel strongly we shouldn't complicate the initial patch with >> this kind of thing. Let's make it enough for this patch to guarantee >> that either all parts of the transaction commit eventually or they all >> abort eventually. Ensuring consistent visibility is a different and >> hard project, and if we try to do that now, this patch is not going to >> be done any time soon. >> > > Thank you for the suggestion. > > I was really wondering if we should add a timeout to this feature. > It's a common concern that we want to put a timeout at critical > section. But currently we don't have such timeout to neither > synchronous replication or writing WAL. I can image there will be > users who want to a timeout for such cases but obviously it makes this > feature more complex. Anyway, even if we add a timeout to this feature > we can make it as a separated patch and feature. So I'd like to keep > it simple as first step. This patch guarantees that the transaction > commit or rollback on all foreign servers or not unless users doesn't > cancel. > > Regards, > I've updated documentation of patches, and fixed some bugs. I did some failure tests of this feature using a fault simulation tool[1] for PostgreSQL that I created. 0001 patch adds a mechanism to track of writes on local server. This is required to determine whether we should use 2pc at commit. 0002 patch is the main part. It adds a distributed transaction manager (currently only for atomic commit), APIs for 2pc and foreign transaction manager resolver process. 0003 patch makes postgres_fdw support atomic commit using 2pc. Please review patches. [1] https://github.com/MasahikoSawada/pg_simula Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
RE: [HACKERS] Transactions involving multiple postgres foreignservers
From
"Tsunakawa, Takayuki"
Date:
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > I've updated documentation of patches, and fixed some bugs. I did some > failure tests of this feature using a fault simulation tool[1] for > PostgreSQL that I created. > > 0001 patch adds a mechanism to track of writes on local server. This is > required to determine whether we should use 2pc at commit. 0002 patch is > the main part. It adds a distributed transaction manager (currently only > for atomic commit), APIs for 2pc and foreign transaction manager resolver > process. 0003 patch makes postgres_fdw support atomic commit using 2pc. > > Please review patches. I'd like to join the review and testing of this functionality. First, some comments after taking a quick look at 0001: (1) Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd!= 0 to determine the transaction did any writes? When the transaction only modified temporary tables on thelocal database and some data on one remote database, I think 2pc is unnecessary. (2) If TransactionDidWrite is necessary, I don't think you need to provide setter functions, because other transaction statevariables are accessed globally without getter/setters. And you didn't create getter function for TransactionDidWrite. (3) heap_multi_insert() doesn't modify TransactionDidWrite. Is it sufficient to just remember heap modifications? Are othermodifications on the coordinator node covered such as TRUNCATEand and REINDEX? Questions before looking at 0002 and 0003: Q1: Does this functionality work when combined with XA 2pc transactions? Q2: Does the atomic commit cascade across multiple remote databases? For example: * The local transaction modifies data on remote database 1 via a foreign table. * A trigger fires on the remote database 1, which modifies data on remote database 2 via a foreign table. * The local transaction commits. Regards Takayuki Tsunakawa
On Thu, Dec 28, 2017 at 11:40 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] >> I've updated documentation of patches, and fixed some bugs. I did some >> failure tests of this feature using a fault simulation tool[1] for >> PostgreSQL that I created. >> >> 0001 patch adds a mechanism to track of writes on local server. This is >> required to determine whether we should use 2pc at commit. 0002 patch is >> the main part. It adds a distributed transaction manager (currently only >> for atomic commit), APIs for 2pc and foreign transaction manager resolver >> process. 0003 patch makes postgres_fdw support atomic commit using 2pc. >> >> Please review patches. > > I'd like to join the review and testing of this functionality. First, some comments after taking a quick look at 0001: Thank you so much! > (1) > Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about usingXactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tableson the local database and some data on one remote database, I think 2pc is unnecessary. Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if we did any writes on local node which requires the atomic commit. Will fix. > > (2) > If TransactionDidWrite is necessary, I don't think you need to provide setter functions, because other transaction statevariables are accessed globally without getter/setters. And you didn't create getter function for TransactionDidWrite. > > (3) > heap_multi_insert() doesn't modify TransactionDidWrite. Is it sufficient to just remember heap modifications? Are othermodifications on the coordinator node covered such as TRUNCATEand and REINDEX? I think the using (XactLastRecEnd != 0 && markXidCommitted) to check if we did any writes on local node would be better. After changed I will be able to deal with the all above concerns. > > > Questions before looking at 0002 and 0003: > > Q1: Does this functionality work when combined with XA 2pc transactions? All transaction including local transaction and foreign transactions are prepared when PREPARE. And all transactions are committed/rollbacked by the foreign transaction resolver process when COMMIT/ROLLBACK PREPARED. > Q2: Does the atomic commit cascade across multiple remote databases? For example: > * The local transaction modifies data on remote database 1 via a foreign table. > * A trigger fires on the remote database 1, which modifies data on remote database 2 via a foreign table. > * The local transaction commits. I've not tested yet more complex failure situations but as far as I tested on my environment, the cascading atomic commit works. I'll test these cases more deeply. Regards, Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> (1) >> Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about usingXactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tableson the local database and some data on one remote database, I think 2pc is unnecessary. > > Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if > we did any writes on local node which requires the atomic commit. Will > fix. > I haven't checked how much code it needs to track whether the local transaction wrote anything. But probably we can post-pone this optimization. If it's easy to incorporate, it's good to have in the first set itself. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Jan 1, 2018 at 7:12 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >>> (1) >>> Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about usingXactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tableson the local database and some data on one remote database, I think 2pc is unnecessary. >> >> Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if >> we did any writes on local node which requires the atomic commit. Will >> fix. >> > > I haven't checked how much code it needs to track whether the local > transaction wrote anything. But probably we can post-pone this > optimization. If it's easy to incorporate, it's good to have in the > first set itself. > Without the track of local writes, we always have to use two-phase commit even when the transaction write data on only one foreign server. It will be cause of unnecessary performance degradation and cause of transaction failure on existing systems. We can set the using two-phase commit per foreign server by ALTER SERVER but it will affect other transactions. If we can configure it per transaction perhaps it will be worth to postpone. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Dec 27, 2017 at 9:40 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > (1) > Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about usingXactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tableson the local database and some data on one remote database, I think 2pc is unnecessary. If I understand correctly, XactLastRecEnd can be set by, for example, a HOT cleanup record, so that doesn't seem like a good thing to use. Whether we need to use 2PC across remote nodes seems like it shouldn't depend on whether a local SELECT statement happened to do a HOT cleanup or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 9, 2018 at 11:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Dec 27, 2017 at 9:40 PM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: >> (1) >> Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about usingXactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tableson the local database and some data on one remote database, I think 2pc is unnecessary. > > If I understand correctly, XactLastRecEnd can be set by, for example, > a HOT cleanup record, so that doesn't seem like a good thing to use. Yes, that's right. > Whether we need to use 2PC across remote nodes seems like it shouldn't > depend on whether a local SELECT statement happened to do a HOT > cleanup or not. So I think we need to check if the top transaction is invalid or not as well. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 9, 2018 at 9:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> If I understand correctly, XactLastRecEnd can be set by, for example, >> a HOT cleanup record, so that doesn't seem like a good thing to use. > > Yes, that's right. > >> Whether we need to use 2PC across remote nodes seems like it shouldn't >> depend on whether a local SELECT statement happened to do a HOT >> cleanup or not. > > So I think we need to check if the top transaction is invalid or not as well. Even if you check both, it doesn't sound like it really does what you want. Won't you still end up partially dependent on whether a HOT cleanup happened, if not in quite the same way as before? How about defining a new bit in MyXactFlags for XACT_FLAGS_WROTENONTEMPREL? Just have heap_insert, heap_update, and heap_delete do something like: if (RelationNeedsWAL(relation)) MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL; Overall, what's the status of this patch? Are we hung up on this issue only, or are there other things? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 8, 2018 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 9, 2018 at 9:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> If I understand correctly, XactLastRecEnd can be set by, for example, >>> a HOT cleanup record, so that doesn't seem like a good thing to use. >> >> Yes, that's right. >> >>> Whether we need to use 2PC across remote nodes seems like it shouldn't >>> depend on whether a local SELECT statement happened to do a HOT >>> cleanup or not. >> >> So I think we need to check if the top transaction is invalid or not as well. > > Even if you check both, it doesn't sound like it really does what you > want. Won't you still end up partially dependent on whether a HOT > cleanup happened, if not in quite the same way as before? How about > defining a new bit in MyXactFlags for XACT_FLAGS_WROTENONTEMPREL? > Just have heap_insert, heap_update, and heap_delete do something like: > > if (RelationNeedsWAL(relation)) > MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL; Agreed. > > Overall, what's the status of this patch? Are we hung up on this > issue only, or are there other things? AFAIK there is no more technical issue in this patch so far other than this issue. The patch has tests and docs, and includes all stuff to support atomic commit to distributed transactions: the introducing both the atomic commit ability to distributed transactions and some corresponding FDW APIs, and having postgres_fdw support 2pc. I think this patch needs to be reviewed, especially the functionality of foreign transaction resolution which is re-designed before. The previous patches doesn't apply cleanly to current HEAD and I've fixed some issues. Attached latest patch set. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Overall, what's the status of this patch? Are we hung up on this >> issue only, or are there other things? > > AFAIK there is no more technical issue in this patch so far other than > this issue. The patch has tests and docs, and includes all stuff to > support atomic commit to distributed transactions: the introducing > both the atomic commit ability to distributed transactions and some > corresponding FDW APIs, and having postgres_fdw support 2pc. I think > this patch needs to be reviewed, especially the functionality of > foreign transaction resolution which is re-designed before. OK. I'm going to give 0002 a read-through now, but I think it would be a good thing if others also contributed to the review effort. There is a lot of code here, and there are a lot of other patches competing for attention. That said, here we go: In the documentation for pg_prepared_fdw_xacts, the first two columns have descriptions ending in a preposition. That's typically to be avoided in formal writing. The first one can be fixed by moving "in" before "which". The second can be fixed by changing "that" to "with which" and then dropping the trailing "with". The first three columns have descriptions ending in a period; the latter two do not. Make it consistent with whatever the surrounding style is, or at least internally consistent if the surrounding style varies. Also, some of the descriptions begin with "the" and others do not; again, seems better to be consistent and adhere to surrounding style. The documentation of the serverid column seems to make no sense. Possibly you mean "OID of the foreign server on which this foreign transaction is prepared"? As it is, you use "foreign server" twice, which is why I'm confused. The documentation of max_prepared_foreign_transactions seems a bit brief. I think that if I want to be able to use 2PC for N transactions each across K servers, this variable needs to be set to N*K, not just N. That's not the right way to word it here, but somehow you probably want to explain that a single local transaction can give rise to multiple foreign transactions and that this should be taken into consideration when setting a value for this variable. Maybe also include a link to where the user can find more information, which brings me to another point: there doesn't seem to be any general, user-facing explanation of this system. You explain the catalogs, the GUCs, the interface, etc. but there's nothing anywhere that explains the overall point of the whole thing, which seems pretty important. The closest thing you've got is a description for people writing FDWs, but we really need a place to explain the concept to *users*. One idea is to add a new chapter to the "Server Administration" section, maybe just after "Logical Replication" and before "Regression Tests". But I'm open to other ideas. It's important that the documentation of the various GUCs provide users with some clue as to how to set them. I notice this particularly for foreign_transaction_resolution_interval; off the top of my head, without having read the rest of this patch, I don't know why I shouldn't want this to be zero. But the others could use more explanation as well. It is unclear from reading the documentation for GetPreparedId why this should be the responsibility of the FDW, or how the FDW is supposed to guarantee uniqueness. PrepareForignTransaction is spelled wrong. Nearby typos: prepareing, tranasaction. A bit further down, "can _prepare" has an unwanted space in the middle. Various places in this section capitalize certain words in the middle of sentences which is not standard English style. For example, in "Every Foreign Data Wrapper is required..." the word "Every" is appropriately capitalized because it begins a sentence, but there's no reason to capitalize the others. Likewise for "...employs Two-phase commit protocol..." and other similar cases. EndForeignTransaction doesn't explain what sorts of things the FDW can legally do when called, or how this method is intended to be used. Those seem like important details. Especially, an FDW that tries to do arbitrary stuff during abort cleanup will probably cause big problems. The fdw-transactions section of the documentation seems to imply that henceforth every FDW must call FdwXactRegisterForeignServer, which I think is an unacceptable API break. It doesn't seem advisable to make this behavior depend on max_prepared_foreign_transactions. I think that it should be an server-level option: use 2PC for this server, or not? FDWs that don't have 2PC default to "no"; but others can be set to "yes" if the user wishes. But we shouldn't force that decision to be on a cluster-wide basis. + <xref linkend="functions-fdw-transaction-table"/> shows the functions + available for foreign transaction managements. management + Resolve a foreign transaction. This function search foreign transaction searches for + matching the criteria and resolve then. This function doesn't resolve critera->arguments, resolve->resolves, doesn't->won't + an entry of which transaction is in-progress, or that is locked by some a foreign transaction which is in progress, or one that is locked by some This doesn't seem like a great API contract. It would be nice to have the guarantee that, if the function returns without error, all transactions that were prepared before this function was run and which match the given arguments are now resolved. Skipping locked transactions removes that guarantee. + This function works the same as <function>pg_resolve_fdw_xact</function> + except it remove foreign transaction entry without resolving. Explain why that's useful. + <entry>OID of the database that the foreign transaction resolver process connects to</entry> to which the ... is connected + <entry>Time of last resolved a foreign transaction</entry> Time at which the process last resolved a foreign transaction + of foreign trasactions. The new wait events aren't documented. Spelling error. + * This module manages the transactions involving foreign servers. Remove this. Doesn't add any information. + * This comment summarises how the transaction manager handles transactions + * involving one or more foreign servers. This too. + * connection is identified by oid fo foreign server and user. fo -> of + * first phase doesn not succeed for whatever reason, the foreign servers doesn -> does But more generally: + * The commit is executed in two phases. In the first phase executed during + * pre-commit phase, transactions are prepared on all the foreign servers, + * which can participate in two-phase commit protocol. Transaction on other + * foreign servers are committed in the same phase. In the second phase, if + * first phase doesn not succeed for whatever reason, the foreign servers + * are asked to rollback respective prepared transactions or abort the + * transactions if they are not prepared. This process is executed by backend + * process that executed the first phase. If the first phase succeeds, the + * backend process registers ourselves to the queue in the shared memory and then + * ask the foreign transaction resolver process to resolve foreign transactions + * that are associated with the its transaction. After resolved all foreign + * transactions by foreign transaction resolve process the backend wakes up + * and resume to process. The only way this can be reliable, I think, is if we prepare all of the remote transactions before committing locally and commit them after committing locally. Otherwise, if we crash or fail before committing locally, our atomic commit is no longer atomic. I think the way this should work is: during pre-commit, we prepare the transaction everywhere. After committing or rolling back, we notify the resolver process and tell it to try to commit or roll back those transactions. If we ask it to commit, we also tell it to notify us when it's done, so that we can wait (interruptibly) for it to finish, and so that we're not trying to locally do work that might fail with an ERROR after already committed (that would confuse the client). If the user hits ^C, then we handle it like synchronous replication: we emit a WARNING saying that the transaction might not be remotely committed yet, and return success. I see you've got that logic in FdwXactWaitToBeResolved() so maybe this comment isn't completely in sync with the latest version of the patch, but I think there are some remaining ordering problems as well; see below. I think it is, generally, confusing to describe this process as having two phases. For one thing, two-phase commit has two phases, and they're not these two phases, but we're talking about them in a patch about two-phase commit. But also, they really aren't two phases. Saying something has two phases means that A happens and then B happens. But, at least as this is presently described here, B might not need to happen at all. So that's not really a phase. I think you need a different word here, like maybe "ways", unless I'm just misunderstanding what this whole thing is saying. + * * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions() + * have been modified to go through fdw_xact->inredo entries that have + * not made to disk yet. This doesn't seem to be true. I see no reference to these functions being modified elsewhere in the patch. Nor is it clear why they would need to be modified. For local 2PC, prepared transactions need to be included in all snapshots that are taken. Otherwise, if a local 2PC transaction were committed, concurrent transactions would see the effects of that transaction appear all at once, even though they hadn't gotten a new snapshot. That is the reason why we need StandbyRecoverPreparedTransactions() before opening for hot standby. But for FDW 2PC, even if we knew which foreign transactions were prepared but not yet committed, we have no mechanism for preventing those changes from being visible on the remote servers, nor do they have any effect on local visibility. So there's no need for this AFAICS. Similarly, this appears to me to be incorrect: + RecoveryRequiresIntParameter("max_prepared_foreign_transactions", + max_prepared_foreign_xacts, + ControlFile->max_prepared_foreign_xacts); I might be confused here, but it seems to me that the value of max_prepared_foreign_transactions is irrelevant to whether we can initialize Hot Standby, because, again, those remote xacts have no effect on our local snapshot. Rather, the problem is that we are *altogether unable to proceed with recovery* if this value is too low, regardless of whether we are doing Hot Standby or not. twophase.c handles that by just erroring out in PrepareRedoAdd() if we run out of slots, and insert_fdw_xact does the same thing (although that message doesn't follow style guidelines -- no space before a colon, please!). So it seems to me that you can just delete this code and the associated documentation mention; these concerns are irrelevant here and the actual failure case is otherwise handled. + * We save them to disk and alos set fdw_xact->ondisk to true. + * * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions() + errmsg("prepread foreign transactions are disabled"), + errmsg("out of foreign trasanction resolver slots"), More typos: alos, RecoverPreparedTrasactions, prepread, trasanction +#include <sys/types.h> +#include <sys/stat.h> +#include <unistd.h> + +#include "postgres.h" Thou shalt not have any #includes before "postgres.h" +#include "miscadmin.h" +#include "funcapi.h" These should be in the main alphabetized list. If that doesn't work, then some header is broken. + if (fdw_conn->serverid == serverid && fdw_conn->userid == userid) + { + fdw_conn->modified |= modify; I suggest avoiding |= with Boolean. It might be harmless, but it would be just as easy to write this as if (existing conditions && !fdw_conn->modified) fdw_conn->modified = true, which avoids any assumption about the bit patterns. + max_hash_size = max_prepared_foreign_xacts; + init_hash_size = max_hash_size / 2; I think we decided that hash tables other than the lock manager should initialize with maximum size = initial size. See 7c797e7194d969f974abf579cacf30ffdccdbb95. + if (list_length(MyFdwConnections) < 1) How about if (MyFdwConnections) == NIL? This occurs multiple times in the patch. + if (list_length(MyFdwConnections) == 0) How about if (MyFdwConnections) == NIL? + if ((list_length(MyFdwConnections) > 1) || + (list_length(MyFdwConnections) == 1 && (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL))) + return true; I think this would be clearer written this way: int nserverswritten = list_length(MyFdwConnections); if (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL != 0) ++nserverswritten; return nserverswritten > 1; But that brings up another issue: why is MyFdwConnections named that way and why does it have those semantics? That is, why do we really need a list of every FDW connection? I think we really only need the ones that are 2PC-capable writes. If a non-2PC-capable foreign server is involved in the transaction, then we don't really to keep it in a list. We just need to flag the transaction as not locally prepareable i.e. clear TwoPhaseReady. I think that this could all be figured out in a much less onerous way: if we ever perform a modification of a foreign table, have nodeModifyTable.c either mark the transaction non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign server is not 2PC capable, or otherwise add the appropriate information to MyFdwConnections, which can then be renamed to indicate that it contains only information about preparable stuff. Then you don't need each individual FDW to be responsible about calling some new function; the core code just does the right thing automatically. + if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid, + fdw_conn->umid, true)) + elog(WARNING, "could not commit transaction on server %s", + fdw_conn->servername); First, as I noted above, this assumes that the local transaction can't fail after this point, which is certainly false -- if nothing else, think about a plug pull. Second, it assumes that the right thing to do would be to throw a WARNING, which I think is false: if this is running in the pre-commit phase, it's not too late to switch to the abort path, and certainly if we make changes on only 1 server and the commit on that server fails, we should be rolling back, not committing with a warning. Third, if we did need to restrict ourselves to warnings here, that's probably impractical. This function needs to do network I/O, which is not a no-fail operation, and even if it works, the remote side can fail in any arbitrary way. +FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid, This is not a very clear function name. Apparently, the thing that is doing the registration is an FdwXact, and the thing being registered is also an FdwXact. + /* + * Between FdwXactRegisterFdwXact call above till this backend hears back + * from foreign server, the backend may abort the local transaction + * (say, because of a signal). During abort processing, it will send + * an ABORT message to the foreign server. If the foreign server has + * not prepared the transaction, the message will succeed. If the + * foreign server has prepared transaction, it will throw an error, + * which we will ignore and the prepared foreign transaction will be + * resolved by a foreign transaction resolver. + */ + if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid, + fdw_conn->umid, fdw_xact_id)) + { Again, I think this is an impractical API contract. It assumes that prepare_foreign_xact can't throw errors, which is likely to make for a lot of implementation problems on the FDW side -- you can't even palloc. You can't call any existing code you've already got that might throw an error for any reason. You definitely can't do a syscache lookup. You can't accept interrupts, even though you're doing network I/O that could hang. The only reason to structure it this way is to avoid having a transaction that we think is prepared in our local bookkeeping when, on the remote side, it really isn't. But that is an unavoidable problem, because the whole system could crash after the remote prepare has been done and before prepare_foreign_xact even returns control. Or we could fail after XLogInsert() and before XLogFlush(). AtEOXact_FdwXacts has the same problem. I think you can fix make this a lot cleaner if you make it part of the API contract that resolver shouldn't fail when asked to roll back a transaction that was never prepared. Then it can work like this: just before we prepare the remote transaction, we note the information in shared memory and in the WAL. If that fails, we just abort. When the resolver sees that our xact is no longer running and did not commit, it concludes that we must have failed and tries to roll back all the remotely-prepared xacts. If they don't exist, then the PREPARE failed; if they do, then we roll 'em back. On the other hand, if all of the prepares succeed, then we commit. Seeing that our XID is no longer running and did commit, the resolver tries to commit those remotely-prepared xacts. In the commit case, we log a complaint if we don't find the xact, but in the abort case, it's got to be an expected outcome. If we really wanted to get paranoid about this, we could log a WAL record after finishing all the prepares -- or even each prepare -- saying "hey, after this point the remote xact should definitely exist!". And then the resolver could complaints a nonexistent remote xact when rolling back only if that record hasn't been issued yet. But to me that looks like overkill; the window between issuing that record and issuing the actual commit record would be very narrow, and in most cases they would be flushed together anyway. We could of course force the new record to be separately flushed first, but that's just making everything slower for no gain. Note that this requires that we not truncate away portions of clog that contain commit status information about no-longer-running transactions that have unresolved FDW 2PC xacts, or at least that we issue a WAL record updating the state of the fdw_xact so that it doesn't refer to that portion of clog any more - e.g. by setting the XID to either FrozenTransactionId or InvalidTransactionId, though that would be a problem since it would break the unique-XID assumption. I don't see the patch doing either of those things right now, although I may have missed it. Note that here again the differences between local 2PC and FDW 2PC really make a difference. Local 2PC has a PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent clog truncation, because the relevant XIDs are showing up in snapshots. The PrescanPreparedTransactions stuff only needs to nail things down until we reach consistency, and then the regular mechanisms take over. We have no such easy out for this system. + if (unlink(path)) + if (errno != ENOENT || giveWarning) Poor style. Use &&, not nested ifs. Oh, I guess you copied this from the twophase.c code; but let's fix it anyway. It's not exactly clear to me what the point of "locked" FdwXacts is, but whatever that point may be, how can remove_fdw_xact() get away with summarily releasing a lock that may be held by some other process? If we're the process that has the FdwXact locked, then we'll delete it from MyLockedFdwXacts, but if some other process has it locked, nothing will happen. If that's safe for some non-obvious reason, it at least needs a comment. I think this whole function could be written with a lot less nesting if you first write a loop to find the appropriate value for cnt, then error out if we end up with cnt >= FdwXactCtl->numFdwXacts, and then finally do all of the stuff that happens once we identify a match. That saves two levels of indentation for most of the function. The delayCkpt interlocking which twophase.c uses is absent here. That's important, because otherwise a checkpoint can happen between the time we write to WAL and the time we actually perform the on-disk operations. If a crash happens just before the operation is actually performed, then it never happens on the master but still happens on the standbys. Oops. +void +FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid) +{ + FdwXact fdw_xact; + + Assert(RecoveryInProgress()); + + fdw_xact = get_fdw_xact(xid, serverid, userid); + + if (fdw_xact) + { + /* Now we can clean up any files we already left */ + Assert(fdw_xact->inredo); + remove_fdw_xact(fdw_xact); + } + else + { + /* + * Entry could be on disk. Call with giveWarning = false + * since it can be expected during replay. + */ + RemoveFdwXactFile(xid, serverid, userid, false); + } +} I hope this won't sound too harsh, but he phrase that comes to mind here is "spaghetti code". First, we look for a matching FdwXact in shared memory and, if we find one, do all of the cleanup inside remove_fdw_xact() which also removes it from the disk. Otherwise, we try to remove it from disk anyway. if (condition) do_a_and_b(); else do_b(); is not generally a good way to structure code. Moreover, it's not clear why we should be doing it like this in the first place. There's no similar logic in twophase.c; PrepareRedoRemove does nothing on disk if the state isn't found in memory. The fact that you've got this set up this way suggests that you haven't structured things so as to guarantee that the in-memory state is always accurate. If so, that should be fixed; if not, this code isn't needed anyway. + if (!fdwXactExitRegistered) + { + before_shmem_exit(AtProcExit_FdwXact, 0); + fdwXactExitRegistered = true; + } Sadly, this code has a latent hazard. If somebody ever calls this from inside a PG_ENSURE_ERROR_CLEANUP() block, then they can end up failing to unregister their handler, because of the limitations described in before_shmem_cleanup()'s handler. It's better to do this in FdwXactShmemInit(). + if (list_length(entries_to_resolve) == 0) Here again, just test against NIL. fdwxact_resolver.c is very light on comments. +static +void FdwXactRslvLoop(void) Not project style. There are other, similar instances. + need_twophase = TwoPhaseCommitRequired(); I think this nomenclature is going to cause confusion. We need to distinguish somehow between using remote 2PC for foreign transactions and using local 2PC. The TwoPhase naming is already well-established as referring to the latter, so I think we should name this some other way. + if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid)) + { + Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp); + ereport(ERROR, + (errmsg("server \"%s\" has unresolved prepared transactions on it", + NameStr(srvForm->srvname)))); + } I think if this happens, it would be more appropriate to just issue a WARNING and forget about those transactions. Blocking DROP is not nice, and shouldn't be done without a really good reason. + (errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0"))); Bogus capitalization. +#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8) I am very much not impressed by this uncommented macro definition. You can probably guess the reason. :-) + ereport(WARNING, (errmsg("could not resolve dangling foreign transaction for xid %u, foreign server %u and user %d", + fdwxact->local_xid, fdwxact->serverid, fdwxact->userid))); Formatting is wrong. My ability to concentrate on this patch is just about exhausted for today so I think I'm going to have to stop here. But in general I would say this patch still needs a lot of work. As noted above, the concurrency, crash-safety, and error-handing issues don't seem to have been thought through carefully enough, and there are even a fairly large number of trivial spelling errors and coding and/or message style violations. Comments are lacking in some places where they are clearly needed. There seems to be a fair amount of work needed to ensure that each thing has exactly one name: not two, and not a shared name with something else, and that all of those names are clear. There are a few TODO items remaining in the code. I think that it is going to take a significant effort to get all of this cleaned up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Feb 10, 2018 at 4:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Overall, what's the status of this patch? Are we hung up on this >>> issue only, or are there other things? >> >> AFAIK there is no more technical issue in this patch so far other than >> this issue. The patch has tests and docs, and includes all stuff to >> support atomic commit to distributed transactions: the introducing >> both the atomic commit ability to distributed transactions and some >> corresponding FDW APIs, and having postgres_fdw support 2pc. I think >> this patch needs to be reviewed, especially the functionality of >> foreign transaction resolution which is re-designed before. > > OK. I'm going to give 0002 a read-through now, but I think it would > be a good thing if others also contributed to the review effort. > There is a lot of code here, and there are a lot of other patches > competing for attention. That said, here we go: I appreciate your reviewing. I'll thoroughly update the whole patch based on your comments and suggestions. Here is answer, question and my understanding. > > The fdw-transactions section of the documentation seems to imply that > henceforth every FDW must call FdwXactRegisterForeignServer, which I > think is an unacceptable API break. > > It doesn't seem advisable to make this behavior depend on > max_prepared_foreign_transactions. I think that it should be an > server-level option: use 2PC for this server, or not? FDWs that don't > have 2PC default to "no"; but others can be set to "yes" if the user > wishes. But we shouldn't force that decision to be on a cluster-wide > basis. Since I've added a new option two_phase_commit to postgres_fdw we need to ask FDW whether the foreign server is 2PC-capable server or not in order to register the foreign server information. That's why the patch asked FDW to call FdwXactRegisterForeignServer. However we can register a foreign server information automatically by executor (e.g. at BeginDirectModify and at BeginForeignModify) if a foreign server itself has that information. We can add two_phase_commit_enabled column to pg_foreign_server system catalog and that column is set to true if the foriegn server is 2PC-capable (i.g. has enough functions) and user want to use it. > > But that brings up another issue: why is MyFdwConnections named that > way and why does it have those semantics? That is, why do we really > need a list of every FDW connection? I think we really only need the > ones that are 2PC-capable writes. If a non-2PC-capable foreign server > is involved in the transaction, then we don't really to keep it in a > list. We just need to flag the transaction as not locally prepareable > i.e. clear TwoPhaseReady. I think that this could all be figured out > in a much less onerous way: if we ever perform a modification of a > foreign table, have nodeModifyTable.c either mark the transaction > non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign > server is not 2PC capable, or otherwise add the appropriate > information to MyFdwConnections, which can then be renamed to indicate > that it contains only information about preparable stuff. Then you > don't need each individual FDW to be responsible about calling some > new function; the core code just does the right thing automatically. I could not get this comment. Did you mean that the foreign transaction on not 2PC-capable foreign server should be end in the same way as before (i.g. by XactCallback)? Currently, because there is not FDW API to end foreign transaction, almost FDWs use XactCallbacks to end the transaction. But after introduced new FDW APIs, I think it's better to use FDW APIs to end transactions rather than using XactCallbacks. Otherwise we end up with having FDW APIs for 2PC (prepare and resolve) and XactCallback for ending the transaction, which would be hard to understand. So I've changed the foreign transaction management so that core code explicitly asks FDW to end/prepare a foreign transaction instead of ending it by individual FDWs. After introduced new FDW APIs, core code can have the information of all foreign servers involved with the transaction and call each APIs at appropriate timing. > > + if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid, > + fdw_conn->umid, true)) > + elog(WARNING, "could not commit transaction on server %s", > + fdw_conn->servername); > > First, as I noted above, this assumes that the local transaction can't > fail after this point, which is certainly false -- if nothing else, > think about a plug pull. Second, it assumes that the right thing to > do would be to throw a WARNING, which I think is false: if this is > running in the pre-commit phase, it's not too late to switch to the > abort path, and certainly if we make changes on only 1 server and the > commit on that server fails, we should be rolling back, not committing > with a warning. Third, if we did need to restrict ourselves to > warnings here, that's probably impractical. This function needs to do > network I/O, which is not a no-fail operation, and even if it works, > the remote side can fail in any arbitrary way. > > > + /* > + * Between FdwXactRegisterFdwXact call above till this > backend hears back > + * from foreign server, the backend may abort the local transaction > + * (say, because of a signal). During abort processing, it will send > + * an ABORT message to the foreign server. If the foreign server has > + * not prepared the transaction, the message will succeed. If the > + * foreign server has prepared transaction, it will throw an error, > + * which we will ignore and the prepared foreign transaction will be > + * resolved by a foreign transaction resolver. > + */ > + if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, > fdw_conn->userid, > + fdw_conn->umid, fdw_xact_id)) > + { > > Again, I think this is an impractical API contract. It assumes that > prepare_foreign_xact can't throw errors, which is likely to make for a > lot of implementation problems on the FDW side -- you can't even > palloc. You can't call any existing code you've already got that > might throw an error for any reason. You definitely can't do a > syscache lookup. You can't accept interrupts, even though you're > doing network I/O that could hang. The only reason to structure it > this way is to avoid having a transaction that we think is prepared in > our local bookkeeping when, on the remote side, it really isn't. But > that is an unavoidable problem, because the whole system could crash > after the remote prepare has been done and before prepare_foreign_xact > even returns control. Or we could fail after XLogInsert() and before > XLogFlush(). > > AtEOXact_FdwXacts has the same problem. > > I think you can fix make this a lot cleaner if you make it part of the > API contract that resolver shouldn't fail when asked to roll back a > transaction that was never prepared. Then it can work like this: just > before we prepare the remote transaction, we note the information in > shared memory and in the WAL. If that fails, we just abort. When the > resolver sees that our xact is no longer running and did not commit, > it concludes that we must have failed and tries to roll back all the > remotely-prepared xacts. If they don't exist, then the PREPARE > failed; if they do, then we roll 'em back. On the other hand, if all > of the prepares succeed, then we commit. Seeing that our XID is no > longer running and did commit, the resolver tries to commit those > remotely-prepared xacts. In the commit case, we log a complaint if we > don't find the xact, but in the abort case, it's got to be an expected > outcome. If we really wanted to get paranoid about this, we could log > a WAL record after finishing all the prepares -- or even each prepare > -- saying "hey, after this point the remote xact should definitely > exist!". And then the resolver could complaints a nonexistent remote > xact when rolling back only if that record hasn't been issued yet. > But to me that looks like overkill; the window between issuing that > record and issuing the actual commit record would be very narrow, and > in most cases they would be flushed together anyway. We could of > course force the new record to be separately flushed first, but that's > just making everything slower for no gain. In FdwXactResolveForeignTranasction(), resolver concludes the fate of transaction by seeing the status of fdwxact entry and the state of local transaction in clog. what I need to do is making that function log a complaint in commit case if couldn't find the prepared transaction, and not do that in abort case. Also, postgres_fdw don't raise an error even if we could not find prepared transaction on foreign server because it might have been resolved by other process. But this is now responsible by FDW. I should change it to resolver side. That is, FDW can raise error in ordinarily way but core code should catch and process it. > > Note that this requires that we not truncate away portions of clog > that contain commit status information about no-longer-running > transactions that have unresolved FDW 2PC xacts, or at least that we > issue a WAL record updating the state of the fdw_xact so that it > doesn't refer to that portion of clog any more - e.g. by setting the > XID to either FrozenTransactionId or InvalidTransactionId, though that > would be a problem since it would break the unique-XID assumption. I > don't see the patch doing either of those things right now, although I > may have missed it. Note that here again the differences between > local 2PC and FDW 2PC really make a difference. Local 2PC has a > PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent > clog truncation, because the relevant XIDs are showing up in > snapshots. The PrescanPreparedTransactions stuff only needs to nail > things down until we reach consistency, and then the regular > mechanisms take over. We have no such easy out for this system. > You're right. Perhaps we can deal with it by PrescanFdwXacts until reach consistent point, and then have vac_update_datfrozenxid check local xids of un-resolved fdwxact to determine the new datfrozenxid. Since the local xids of un-resolved fdwxacts would not be relevant with vacuuming, we don't need to include it to snapshot and GetOldestXmin etc. Also we hint to resolve fdwxact when near wraparound. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> The fdw-transactions section of the documentation seems to imply that >> henceforth every FDW must call FdwXactRegisterForeignServer, which I >> think is an unacceptable API break. >> >> It doesn't seem advisable to make this behavior depend on >> max_prepared_foreign_transactions. I think that it should be an >> server-level option: use 2PC for this server, or not? FDWs that don't >> have 2PC default to "no"; but others can be set to "yes" if the user >> wishes. But we shouldn't force that decision to be on a cluster-wide >> basis. > > Since I've added a new option two_phase_commit to postgres_fdw we need > to ask FDW whether the foreign server is 2PC-capable server or not in > order to register the foreign server information. That's why the patch > asked FDW to call FdwXactRegisterForeignServer. However we can > register a foreign server information automatically by executor (e.g. > at BeginDirectModify and at BeginForeignModify) if a foreign server > itself has that information. We can add two_phase_commit_enabled > column to pg_foreign_server system catalog and that column is set to > true if the foriegn server is 2PC-capable (i.g. has enough functions) > and user want to use it. I don't see why this would need a new catalog column. >> But that brings up another issue: why is MyFdwConnections named that >> way and why does it have those semantics? That is, why do we really >> need a list of every FDW connection? I think we really only need the >> ones that are 2PC-capable writes. If a non-2PC-capable foreign server >> is involved in the transaction, then we don't really to keep it in a >> list. We just need to flag the transaction as not locally prepareable >> i.e. clear TwoPhaseReady. I think that this could all be figured out >> in a much less onerous way: if we ever perform a modification of a >> foreign table, have nodeModifyTable.c either mark the transaction >> non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign >> server is not 2PC capable, or otherwise add the appropriate >> information to MyFdwConnections, which can then be renamed to indicate >> that it contains only information about preparable stuff. Then you >> don't need each individual FDW to be responsible about calling some >> new function; the core code just does the right thing automatically. > > I could not get this comment. Did you mean that the foreign > transaction on not 2PC-capable foreign server should be end in the > same way as before (i.g. by XactCallback)? > > Currently, because there is not FDW API to end foreign transaction, > almost FDWs use XactCallbacks to end the transaction. But after > introduced new FDW APIs, I think it's better to use FDW APIs to end > transactions rather than using XactCallbacks. Otherwise we end up with > having FDW APIs for 2PC (prepare and resolve) and XactCallback for > ending the transaction, which would be hard to understand. So I've > changed the foreign transaction management so that core code > explicitly asks FDW to end/prepare a foreign transaction instead of > ending it by individual FDWs. After introduced new FDW APIs, core code > can have the information of all foreign servers involved with the > transaction and call each APIs at appropriate timing. Well, it's one thing to introduce a new API. It's another thing to require existing FDWs to be updated to use it. There are a lot of existing FDWs out there, and I think that it is needlessly unfriendly to force them all to be updated for v11 (or whenever this gets committed) even if we think the new API is clearly better. FDWs that work today should continue working after this patch is committed. Separately, I think there's a question of whether the new API is in fact better -- I'm not sure I have a completely well-formed opinion about that yet. > In FdwXactResolveForeignTranasction(), resolver concludes the fate of > transaction by seeing the status of fdwxact entry and the state of > local transaction in clog. what I need to do is making that function > log a complaint in commit case if couldn't find the prepared > transaction, and not do that in abort case. +1. > Also, postgres_fdw don't > raise an error even if we could not find prepared transaction on > foreign server because it might have been resolved by other process. +1. > But this is now responsible by FDW. I should change it to resolver > side. That is, FDW can raise error in ordinarily way but core code > should catch and process it. I don't understand exactly what you mean here. > You're right. Perhaps we can deal with it by PrescanFdwXacts until > reach consistent point, and then have vac_update_datfrozenxid check > local xids of un-resolved fdwxact to determine the new datfrozenxid. > Since the local xids of un-resolved fdwxacts would not be relevant > with vacuuming, we don't need to include it to snapshot and > GetOldestXmin etc. Also we hint to resolve fdwxact when near > wraparound. I agree with you about snapshots, but I'm not sure about GetOldestXmin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 21, 2018 at 6:07 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> The fdw-transactions section of the documentation seems to imply that >>> henceforth every FDW must call FdwXactRegisterForeignServer, which I >>> think is an unacceptable API break. >>> >>> It doesn't seem advisable to make this behavior depend on >>> max_prepared_foreign_transactions. I think that it should be an >>> server-level option: use 2PC for this server, or not? FDWs that don't >>> have 2PC default to "no"; but others can be set to "yes" if the user >>> wishes. But we shouldn't force that decision to be on a cluster-wide >>> basis. >> >> Since I've added a new option two_phase_commit to postgres_fdw we need >> to ask FDW whether the foreign server is 2PC-capable server or not in >> order to register the foreign server information. That's why the patch >> asked FDW to call FdwXactRegisterForeignServer. However we can >> register a foreign server information automatically by executor (e.g. >> at BeginDirectModify and at BeginForeignModify) if a foreign server >> itself has that information. We can add two_phase_commit_enabled >> column to pg_foreign_server system catalog and that column is set to >> true if the foriegn server is 2PC-capable (i.g. has enough functions) >> and user want to use it. > > I don't see why this would need a new catalog column. I might be missing your point. As for API breaking, this patch doesn't break any existing FDWs. All new APIs I proposed are dedicated to 2PC. In other words, FDWs that work today can continue working after this patch gets committed, but if FDWs want to support atomic commit then they should be updated to use new APIs. The reason why the calling of FdwXactRegisterForeignServer is necessary is that the core code controls the foreign servers that involved with the transaction but the whether each foreign server uses 2PC option (two_phase_commit) is known only on FDW code. We can eliminate the necessity of calling FdwXactRegisterForeignServer by moving 2PC option from fdw-level to server-level in order not to enforce calling the registering function on FDWs. If we did that, the user can use the 2PC option as a server-level option. > >>> But that brings up another issue: why is MyFdwConnections named that >>> way and why does it have those semantics? That is, why do we really >>> need a list of every FDW connection? I think we really only need the >>> ones that are 2PC-capable writes. If a non-2PC-capable foreign server >>> is involved in the transaction, then we don't really to keep it in a >>> list. We just need to flag the transaction as not locally prepareable >>> i.e. clear TwoPhaseReady. I think that this could all be figured out >>> in a much less onerous way: if we ever perform a modification of a >>> foreign table, have nodeModifyTable.c either mark the transaction >>> non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign >>> server is not 2PC capable, or otherwise add the appropriate >>> information to MyFdwConnections, which can then be renamed to indicate >>> that it contains only information about preparable stuff. Then you >>> don't need each individual FDW to be responsible about calling some >>> new function; the core code just does the right thing automatically. >> >> I could not get this comment. Did you mean that the foreign >> transaction on not 2PC-capable foreign server should be end in the >> same way as before (i.g. by XactCallback)? >> >> Currently, because there is not FDW API to end foreign transaction, >> almost FDWs use XactCallbacks to end the transaction. But after >> introduced new FDW APIs, I think it's better to use FDW APIs to end >> transactions rather than using XactCallbacks. Otherwise we end up with >> having FDW APIs for 2PC (prepare and resolve) and XactCallback for >> ending the transaction, which would be hard to understand. So I've >> changed the foreign transaction management so that core code >> explicitly asks FDW to end/prepare a foreign transaction instead of >> ending it by individual FDWs. After introduced new FDW APIs, core code >> can have the information of all foreign servers involved with the >> transaction and call each APIs at appropriate timing. > > Well, it's one thing to introduce a new API. It's another thing to > require existing FDWs to be updated to use it. There are a lot of > existing FDWs out there, and I think that it is needlessly unfriendly > to force them all to be updated for v11 (or whenever this gets > committed) even if we think the new API is clearly better. FDWs that > work today should continue working after this patch is committed. Agreed. > Separately, I think there's a question of whether the new API is in > fact better -- I'm not sure I have a completely well-formed opinion > about that yet. I think one API should do one job. In terms of keeping simple API the current four APIs would be not bad. AFAICS other DB server that support 2PC such as MySQL, Oracle etc can be satisfied with this API. I'm thinking to remove a user mapping id from argument of three APIs (preparing, resolution and end). Because user mapping id can be found by {serverid, userid}. Also we can make get prepare-id API an optional API. That is, if FDW doesn't define this API the core code always passes px_<randam>_<serverid>_<userid> by default. For foreign server that has the short limit of prepare-id length, it need to provide the API. >> In FdwXactResolveForeignTranasction(), resolver concludes the fate of >> transaction by seeing the status of fdwxact entry and the state of >> local transaction in clog. what I need to do is making that function >> log a complaint in commit case if couldn't find the prepared >> transaction, and not do that in abort case. > > +1. > >> Also, postgres_fdw don't >> raise an error even if we could not find prepared transaction on >> foreign server because it might have been resolved by other process. > > +1. > >> But this is now responsible by FDW. I should change it to resolver >> side. That is, FDW can raise error in ordinarily way but core code >> should catch and process it. > > I don't understand exactly what you mean here. Hmm I think I got confused. My understanding is that logging a complaint in commit case and not doing that in abort case if prepared transaction doesn't exist is a core code's job. An API contract here is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if there is no such transaction. Since it's an expected case in abort case for the fdwxact manager, the core code can regard the error as not actual problem. So for FDWs basically they can raise an error if resolution is failed for whatever reason. But postgres_fdw doesn't raise an error in case where prepared transaction doesn't exist because in PostgreSQL prepared transaction can be ended by other process. The pseudo-code of resolution part is following. --- // Core code of foreign transaction resoliution PG_TRY(); { call API to resolve foreign transaction } PG_CATCH(); { if (sqlstate is ERRCODE_UNDEFINED_OBJECT) { if (we are committing) raise ERROR else // we are aborting raise WARNING // this is an expected result } else raise ERROR // re-throw the error } PG_END_TRY(); // postgres_fdw code of prepared transaction resolution do "COMMIT/ROLLBACK PREPARED" if (failed to resolve) { if (sqlstate is ERRCODE_UNDEFINED_OBJECT) { raise WARNING return true; // regards as succeeded } else raise ERROR // failed to resolve } return true; --- Or do you mean that FDWs should not raise an error if there is the prepared transaction, and then core code doesn't need to check sqlstate in case of error? > >> You're right. Perhaps we can deal with it by PrescanFdwXacts until >> reach consistent point, and then have vac_update_datfrozenxid check >> local xids of un-resolved fdwxact to determine the new datfrozenxid. >> Since the local xids of un-resolved fdwxacts would not be relevant >> with vacuuming, we don't need to include it to snapshot and >> GetOldestXmin etc. Also we hint to resolve fdwxact when near >> wraparound. > > I agree with you about snapshots, but I'm not sure about GetOldestXmin. Hmm, although I've thought concern in case where we don't consider local xids of un-resolved fdwxact in GetOldestXmin, I could not find problem. Could you share your concern if you have? I'll try to find a possibility based on it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 2/27/18 2:21 AM, Masahiko Sawada wrote: > > Hmm, although I've thought concern in case where we don't consider > local xids of un-resolved fdwxact in GetOldestXmin, I could not find > problem. Could you share your concern if you have? I'll try to find a > possibility based on it. It appears that this entry should be marked Waiting on Author so I have done that. I also think it may be time to move this patch to the next CF. Regards, -- -David david@pgmasters.net
On Thu, Mar 29, 2018 at 2:27 AM, David Steele <david@pgmasters.net> wrote: > On 2/27/18 2:21 AM, Masahiko Sawada wrote: >> >> Hmm, although I've thought concern in case where we don't consider >> local xids of un-resolved fdwxact in GetOldestXmin, I could not find >> problem. Could you share your concern if you have? I'll try to find a >> possibility based on it. > > It appears that this entry should be marked Waiting on Author so I have > done that. > > I also think it may be time to move this patch to the next CF. > I agree to move this patch to the next CF. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I might be missing your point. As for API breaking, this patch doesn't > break any existing FDWs. All new APIs I proposed are dedicated to 2PC. > In other words, FDWs that work today can continue working after this > patch gets committed, but if FDWs want to support atomic commit then > they should be updated to use new APIs. The reason why the calling of > FdwXactRegisterForeignServer is necessary is that the core code > controls the foreign servers that involved with the transaction but > the whether each foreign server uses 2PC option (two_phase_commit) is > known only on FDW code. We can eliminate the necessity of calling > FdwXactRegisterForeignServer by moving 2PC option from fdw-level to > server-level in order not to enforce calling the registering function > on FDWs. If we did that, the user can use the 2PC option as a > server-level option. Well, FdwXactRegisterForeignServer has a "bool two_phase_commit" argument. If it only needs to be called by FDWs that support 2PC, then that argument is unnecessary. If it needs to be called by all FDWs, then it breaks existing FDWs that don't call it. >>> But this is now responsible by FDW. I should change it to resolver >>> side. That is, FDW can raise error in ordinarily way but core code >>> should catch and process it. >> >> I don't understand exactly what you mean here. > > Hmm I think I got confused. My understanding is that logging a > complaint in commit case and not doing that in abort case if prepared > transaction doesn't exist is a core code's job. An API contract here > is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if > there is no such transaction. Since it's an expected case in abort > case for the fdwxact manager, the core code can regard the error as > not actual problem. In general, it's not safe to catch an error and continue unless you protect the code that throws the error by a sub-transaction. That means we shouldn't expect the FDW to throw an error when the prepared transaction isn't found and then just have the core code ignore the error. Instead the FDW should return normally and, if the core code needs to know whether the prepared transaction was found, then the FDW should indicate this through a return value, not an ERROR. > Or do you mean that FDWs should not raise an error if there is the > prepared transaction, and then core code doesn't need to check > sqlstate in case of error? Right. As noted above, that's unsafe, so we shouldn't do it. >>> You're right. Perhaps we can deal with it by PrescanFdwXacts until >>> reach consistent point, and then have vac_update_datfrozenxid check >>> local xids of un-resolved fdwxact to determine the new datfrozenxid. >>> Since the local xids of un-resolved fdwxacts would not be relevant >>> with vacuuming, we don't need to include it to snapshot and >>> GetOldestXmin etc. Also we hint to resolve fdwxact when near >>> wraparound. >> >> I agree with you about snapshots, but I'm not sure about GetOldestXmin. > > Hmm, although I've thought concern in case where we don't consider > local xids of un-resolved fdwxact in GetOldestXmin, I could not find > problem. Could you share your concern if you have? I'll try to find a > possibility based on it. I don't remember exactly what I was thinking when I wrote that, but I think the point is that GetOldestXmin() does a bunch of things other than control the threshold for VACUUM, and we'd need to study them all and look for problems. For example, it won't do for an XID to get reused while it's still associated with an unresolved fdwxact. We therefore certainly need such XIDs to hold back the cluster-wide threshold for clog truncation in some manner -- and maybe that involves GetOldestXmin(). Or maybe not. But anyway the point, broadly considered, is that GetOldestXmin() is used in various ways, and I don't know if we've thought through all of the consequences in regard to this new feature. Can I ask what your time frame is for updating this patch? Considering how much work appears to remain, if you want to get this committed to v12, it would be best to get started as early as possible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for the comment. On Fri, May 11, 2018 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I might be missing your point. As for API breaking, this patch doesn't >> break any existing FDWs. All new APIs I proposed are dedicated to 2PC. >> In other words, FDWs that work today can continue working after this >> patch gets committed, but if FDWs want to support atomic commit then >> they should be updated to use new APIs. The reason why the calling of >> FdwXactRegisterForeignServer is necessary is that the core code >> controls the foreign servers that involved with the transaction but >> the whether each foreign server uses 2PC option (two_phase_commit) is >> known only on FDW code. We can eliminate the necessity of calling >> FdwXactRegisterForeignServer by moving 2PC option from fdw-level to >> server-level in order not to enforce calling the registering function >> on FDWs. If we did that, the user can use the 2PC option as a >> server-level option. > > Well, FdwXactRegisterForeignServer has a "bool two_phase_commit" > argument. If it only needs to be called by FDWs that support 2PC, > then that argument is unnecessary. If it needs to be called by all > FDWs, then it breaks existing FDWs that don't call it. > I understood now. For now since FdwXactRegisterForeignServer needs to be called by only FDWs that support 2PC, I will remove the argument. >>>> But this is now responsible by FDW. I should change it to resolver >>>> side. That is, FDW can raise error in ordinarily way but core code >>>> should catch and process it. >>> >>> I don't understand exactly what you mean here. >> >> Hmm I think I got confused. My understanding is that logging a >> complaint in commit case and not doing that in abort case if prepared >> transaction doesn't exist is a core code's job. An API contract here >> is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if >> there is no such transaction. Since it's an expected case in abort >> case for the fdwxact manager, the core code can regard the error as >> not actual problem. > > In general, it's not safe to catch an error and continue unless you > protect the code that throws the error by a sub-transaction. That > means we shouldn't expect the FDW to throw an error when the prepared > transaction isn't found and then just have the core code ignore the > error. Instead the FDW should return normally and, if the core code > needs to know whether the prepared transaction was found, then the FDW > should indicate this through a return value, not an ERROR. > >> Or do you mean that FDWs should not raise an error if there is the >> prepared transaction, and then core code doesn't need to check >> sqlstate in case of error? > > Right. As noted above, that's unsafe, so we shouldn't do it. Thank you. I will think the API contract again based on your suggestion. > >>>> You're right. Perhaps we can deal with it by PrescanFdwXacts until >>>> reach consistent point, and then have vac_update_datfrozenxid check >>>> local xids of un-resolved fdwxact to determine the new datfrozenxid. >>>> Since the local xids of un-resolved fdwxacts would not be relevant >>>> with vacuuming, we don't need to include it to snapshot and >>>> GetOldestXmin etc. Also we hint to resolve fdwxact when near >>>> wraparound. >>> >>> I agree with you about snapshots, but I'm not sure about GetOldestXmin. >> >> Hmm, although I've thought concern in case where we don't consider >> local xids of un-resolved fdwxact in GetOldestXmin, I could not find >> problem. Could you share your concern if you have? I'll try to find a >> possibility based on it. > > I don't remember exactly what I was thinking when I wrote that, but I > think the point is that GetOldestXmin() does a bunch of things other > than control the threshold for VACUUM, and we'd need to study them all > and look for problems. For example, it won't do for an XID to get > reused while it's still associated with an unresolved fdwxact. We > therefore certainly need such XIDs to hold back the cluster-wide > threshold for clog truncation in some manner -- and maybe that > involves GetOldestXmin(). Or maybe not. But anyway the point, > broadly considered, is that GetOldestXmin() is used in various ways, > and I don't know if we've thought through all of the consequences in > regard to this new feature. Okay, I'll have GetOldestXmin() consider the oldest local xid of un-resolved fdwxact as well in the next version patch for more safety, while considering more efficient ways. > > Can I ask what your time frame is for updating this patch? > Considering how much work appears to remain, if you want to get this > committed to v12, it would be best to get started as early as > possible. > I'll post an updated patch by PGCon at the latest, hopefully in the next week. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, May 11, 2018 at 9:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for the comment. > > On Fri, May 11, 2018 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> I might be missing your point. As for API breaking, this patch doesn't >>> break any existing FDWs. All new APIs I proposed are dedicated to 2PC. >>> In other words, FDWs that work today can continue working after this >>> patch gets committed, but if FDWs want to support atomic commit then >>> they should be updated to use new APIs. The reason why the calling of >>> FdwXactRegisterForeignServer is necessary is that the core code >>> controls the foreign servers that involved with the transaction but >>> the whether each foreign server uses 2PC option (two_phase_commit) is >>> known only on FDW code. We can eliminate the necessity of calling >>> FdwXactRegisterForeignServer by moving 2PC option from fdw-level to >>> server-level in order not to enforce calling the registering function >>> on FDWs. If we did that, the user can use the 2PC option as a >>> server-level option. >> >> Well, FdwXactRegisterForeignServer has a "bool two_phase_commit" >> argument. If it only needs to be called by FDWs that support 2PC, >> then that argument is unnecessary. If it needs to be called by all >> FDWs, then it breaks existing FDWs that don't call it. >> > > I understood now. For now since FdwXactRegisterForeignServer needs to > be called by only FDWs that support 2PC, I will remove the argument. Regarding to API design, should we use 2PC for a distributed transaction if both two or more 2PC-capable foreign servers and 2PC-non-capable foreign server are involved with it? Or should we end up with an error? the 2PC-non-capable server might be either that has 2PC functionality but just disables it or that doesn't have it. If we use it, the transaction atomicity will be satisfied among only 2PC-capable servers that might be part of all participants. Or If we don't, we use 1PC instead in such case but when using 2PC transactions always ends up with satisfying the transaction atomicity among all participants of the transaction. The current patch takes the former (doesn't allow PREPARE case though), but I think we also could take the latter way because it doesn't make sense for user even if the transaction commit atomically among not all participants. Also, regardless whether we take either way I think it would be better to manage not only 2PC transaction but also non-2PC transaction in the core and add two_phase_commit argument. I think we can use it without breaking existing FDWs. Currently FDWs manage transactions using XactCallback but new APIs being added also manage transactions. I think it might be better if users use either way (using XactCallback or using new APIs) for transaction management rather than use both ways with combination. Otherwise two codes for transaction management will be required: the code that manages foreign transactions using XactCallback for non-2PC transactions and code that manages them using new APIs for 2PC transactions. That would not be easy for FDW developers. So what I imagined for new API is that if FDW developers use new APIs they can use both 2PC and non-2PC transaction, but if they use XactCallback they can use only non-2PC transaction. Any thoughts? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
RE: [HACKERS] Transactions involving multiple postgres foreignservers
From
"Tsunakawa, Takayuki"
Date:
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > Regarding to API design, should we use 2PC for a distributed > transaction if both two or more 2PC-capable foreign servers and > 2PC-non-capable foreign server are involved with it? Or should we end > up with an error? the 2PC-non-capable server might be either that has > 2PC functionality but just disables it or that doesn't have it. >but I think we also could take > the latter way because it doesn't make sense for user even if the > transaction commit atomically among not all participants. I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement issued from an application reports an error. DBMS,particularly relational DBMS (, and even more particularly Postgres?) places high value on data correctness. So I thinktransaction atomicity should be preserved, at least by default. If we preferred updatability and performance to datacorrectness, why don't we change the default value of synchronous_commit to off in favor of performance? On the otherhand, if we want to allow 1PC commit when not all FDWs support 2PC, we can add a new GUC parameter like "allow_nonatomic_commit= on", just like synchronous_commit and fsync trade-offs data correctness and performance. > Also, regardless whether we take either way I think it would be > better to manage not only 2PC transaction but also non-2PC transaction > in the core and add two_phase_commit argument. I think we can use it > without breaking existing FDWs. Currently FDWs manage transactions > using XactCallback but new APIs being added also manage transactions. > I think it might be better if users use either way (using XactCallback > or using new APIs) for transaction management rather than use both > ways with combination. Otherwise two codes for transaction management > will be required: the code that manages foreign transactions using > XactCallback for non-2PC transactions and code that manages them using > new APIs for 2PC transactions. That would not be easy for FDW > developers. So what I imagined for new API is that if FDW developers > use new APIs they can use both 2PC and non-2PC transaction, but if > they use XactCallback they can use only non-2PC transaction. > Any thoughts? If we add new functions, can't we just add functions whose names are straightforward like PrepareTransaction() and CommitTransaction()? FDWs without 2PC support returns NULL for the function pointer of PrepareTransaction(). This is similar to XA: XA requires each RM to provide function pointers for xa_prepare() and xa_commit(). If we go thisway, maybe we could leverage the artifact of postgres_fdw to create the XA library for C/C++. I mean we put transactioncontrol functions in the XA library, and postgres_fdw also uses it. i.e.: postgres_fdw.so -> libxa.so -> libpq.so \-------------/ Regards Takayuki Tsunakawa
On Fri, May 11, 2018 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> I might be missing your point. As for API breaking, this patch doesn't >> break any existing FDWs. All new APIs I proposed are dedicated to 2PC. >> In other words, FDWs that work today can continue working after this >> patch gets committed, but if FDWs want to support atomic commit then >> they should be updated to use new APIs. The reason why the calling of >> FdwXactRegisterForeignServer is necessary is that the core code >> controls the foreign servers that involved with the transaction but >> the whether each foreign server uses 2PC option (two_phase_commit) is >> known only on FDW code. We can eliminate the necessity of calling >> FdwXactRegisterForeignServer by moving 2PC option from fdw-level to >> server-level in order not to enforce calling the registering function >> on FDWs. If we did that, the user can use the 2PC option as a >> server-level option. > > Well, FdwXactRegisterForeignServer has a "bool two_phase_commit" > argument. If it only needs to be called by FDWs that support 2PC, > then that argument is unnecessary. An FDW may support 2PC but not a foreign server created using that FDW. Without that argument all the foreign servers created using a given FDW will need to support 2PC which may not be possible. > If it needs to be called by all > FDWs, then it breaks existing FDWs that don't call it. > That's true. By default FDWs, which do not want to use this facility, can just pass false without any need for further change. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, May 21, 2018 at 10:42 AM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] >> Regarding to API design, should we use 2PC for a distributed >> transaction if both two or more 2PC-capable foreign servers and >> 2PC-non-capable foreign server are involved with it? Or should we end >> up with an error? the 2PC-non-capable server might be either that has >> 2PC functionality but just disables it or that doesn't have it. > >>but I think we also could take >> the latter way because it doesn't make sense for user even if the >> transaction commit atomically among not all participants. > > I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement issued from an application reports an error. I'm not sure that we should end up with an error in such case, but if we want then we can raise an error when the transaction tries to modify 2PC-non-capable server after modified 2PC-capable server. > DBMS, particularly relational DBMS (, and even more particularly Postgres?) places high value on data correctness. SoI think transaction atomicity should be preserved, at least by default. If we preferred updatability and performance todata correctness, why don't we change the default value of synchronous_commit to off in favor of performance? On the otherhand, if we want to allow 1PC commit when not all FDWs support 2PC, we can add a new GUC parameter like "allow_nonatomic_commit= on", just like synchronous_commit and fsync trade-offs data correctness and performance. Honestly I'm not sure we should use atomic commit by default at this point. Because it also means to change default behavior though the existing users use them without 2PC. But I think control of global transaction atomicity by GUC parameter would be a good idea. For example, synchronous_commit = 'global' makes backends wait for transaction to be resolved globally before returning to the user. > > >> Also, regardless whether we take either way I think it would be >> better to manage not only 2PC transaction but also non-2PC transaction >> in the core and add two_phase_commit argument. I think we can use it >> without breaking existing FDWs. Currently FDWs manage transactions >> using XactCallback but new APIs being added also manage transactions. >> I think it might be better if users use either way (using XactCallback >> or using new APIs) for transaction management rather than use both >> ways with combination. Otherwise two codes for transaction management >> will be required: the code that manages foreign transactions using >> XactCallback for non-2PC transactions and code that manages them using >> new APIs for 2PC transactions. That would not be easy for FDW >> developers. So what I imagined for new API is that if FDW developers >> use new APIs they can use both 2PC and non-2PC transaction, but if >> they use XactCallback they can use only non-2PC transaction. >> Any thoughts? > > If we add new functions, can't we just add functions whose names are straightforward like PrepareTransaction() and CommitTransaction()? FDWs without 2PC support returns NULL for the function pointer of PrepareTransaction(). > > This is similar to XA: XA requires each RM to provide function pointers for xa_prepare() and xa_commit(). If we go thisway, maybe we could leverage the artifact of postgres_fdw to create the XA library for C/C++. I mean we put transactioncontrol functions in the XA library, and postgres_fdw also uses it. i.e.: > > postgres_fdw.so -> libxa.so -> libpq.so > \-------------/ I might not understand your comment correctly but the current patch is implemented in such way. The patch introduces new FDW APIs: PrepareForeignTransaction, EndForeignTransaction, ResolvePreparedForeignTransaction and GetPreapreId. The postgres core calls each APIs at appropriate timings while managing each foreign transactions. FDWs that don't support 2PC set the function pointers of them to NULL. Also, regarding the current API design it might not fit to other databases than PostgreSQL. For example, in MySQL we have to start xa transaction explicitly using by XA START whereas PostgreSQL can prepare the transaction that is started by BEGIN TRANSACTION. So in MySQL global transaction id is required at beginning of xa transaction. And we have to execute XA END is required before we prepare or commit it at one phase. So it would be better to define APIs according to X/Open XA in order to make it more general. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
RE: [HACKERS] Transactions involving multiple postgres foreignservers
From
"Tsunakawa, Takayuki"
Date:
From: Masahiko Sawada [mailto:sawada.mshk@gmail.com] > > I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement > issued from an application reports an error. > > I'm not sure that we should end up with an error in such case, but if > we want then we can raise an error when the transaction tries to > modify 2PC-non-capable server after modified 2PC-capable server. Such early reporting would be better, but I wonder if we can handle the opposite order: update data on a 2PC-capable serverafter a 2PC-non-capable server. If it's not easy or efficient, I think it's enough to report the error at COMMIT andPREPARE TRANSACTION, just like we report "ERROR: cannot PREPARE a transaction that has operated on temporary tables"at PREPARE TRANSACTION. > Honestly I'm not sure we should use atomic commit by default at this > point. Because it also means to change default behavior though the > existing users use them without 2PC. But I think control of global > transaction atomicity by GUC parameter would be a good idea. For > example, synchronous_commit = 'global' makes backends wait for > transaction to be resolved globally before returning to the user. Regarding the incompatibility of default behavior, Postgres has the history to pursue correctness and less confusion, suchas the handling of backslash characters in strings and automatic type casts below. Non-character data types are no longer automatically cast to TEXT (Peter, Tom) Previously, if a non-character value was supplied to an operator or function that requires text input, it was automaticallycast to text, for most (though not all) built-in data types. This no longer happens: an explicit cast to textis now required for all non-character-string types. ... The reason for the change is that these automatic casts too oftencaused surprising behavior. > I might not understand your comment correctly but the current patch is > implemented in such way. The patch introduces new FDW APIs: > PrepareForeignTransaction, EndForeignTransaction, > ResolvePreparedForeignTransaction and GetPreapreId. The postgres core > calls each APIs at appropriate timings while managing each foreign > transactions. FDWs that don't support 2PC set the function pointers of > them to NULL. Ouch, you are right. > Also, regarding the current API design it might not fit to other > databases than PostgreSQL. For example, in MySQL we have to start xa > transaction explicitly using by XA START whereas PostgreSQL can > prepare the transaction that is started by BEGIN TRANSACTION. So in > MySQL global transaction id is required at beginning of xa > transaction. And we have to execute XA END is required before we > prepare or commit it at one phase. So it would be better to define > APIs according to X/Open XA in order to make it more general. I thought of: * Put the functions that implement xa_prepare(), xa_commit() and xa_rollback() in libxa.so or libpq.so. * PrepareForeignTransaction and EndForeignTransaction for postgres_fdw call them. I meant just code reuse for Postgres. But this is my simple intuition, so don't mind. Regards Takayuki Tsunakawa
On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Regarding to API design, should we use 2PC for a distributed > transaction if both two or more 2PC-capable foreign servers and > 2PC-non-capable foreign server are involved with it? Or should we end > up with an error? the 2PC-non-capable server might be either that has > 2PC functionality but just disables it or that doesn't have it. It seems to me that this is functionality that many people will not want to use. First, doing a PREPARE and then a COMMIT for each FDW write transaction is bound to be more expensive than just doing a COMMIT. Second, because the default value of max_prepared_transactions is 0, this can only work at all if special configuration has been done on the remote side. Because of the second point in particular, it seems to me that the default for this new feature must be "off". It would make to ship a default configuration of PostgreSQL that doesn't work with the default configuration of postgres_fdw, and I do not think we want to change the default value of max_prepared_transactions. It was changed from 5 to 0 a number of years back for good reason. So, I think the question could be broadened a bit: how you enable this feature if you want it, and what happens if you want it but it's not available for your choice of FDW? One possible enabling method is a GUC (e.g. foreign_twophase_commit). It could be true/false, with true meaning use PREPARE for all FDW writes and fail if that's not supported, or it could be three-valued, like require/prefer/disable, with require throwing an error if PREPARE support is not available and prefer using PREPARE where available but without failing when it isn't available. Another possibility could be to make it an FDW option, possibly capable of being set at multiple levels (e.g. server or foreign table). If any FDW involved in the transaction demands distributed 2PC semantics then the whole transaction must have those semantics or it fails. I was previous leaning toward the latter approach, but I guess now the former approach is sounding better. I'm not totally certain I know what's best here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Regarding to API design, should we use 2PC for a distributed >> transaction if both two or more 2PC-capable foreign servers and >> 2PC-non-capable foreign server are involved with it? Or should we end >> up with an error? the 2PC-non-capable server might be either that has >> 2PC functionality but just disables it or that doesn't have it. > > It seems to me that this is functionality that many people will not > want to use. First, doing a PREPARE and then a COMMIT for each FDW > write transaction is bound to be more expensive than just doing a > COMMIT. Second, because the default value of > max_prepared_transactions is 0, this can only work at all if special > configuration has been done on the remote side. Because of the second > point in particular, it seems to me that the default for this new > feature must be "off". It would make to ship a default configuration > of PostgreSQL that doesn't work with the default configuration of > postgres_fdw, and I do not think we want to change the default value > of max_prepared_transactions. It was changed from 5 to 0 a number of > years back for good reason. I'm not sure that many people will not want to use this feature because it seems to me that there are many people who don't want to use the database that is missing transaction atomicity. But I agree that this feature should not be enabled by default as we disable 2PC by default. > > So, I think the question could be broadened a bit: how you enable this > feature if you want it, and what happens if you want it but it's not > available for your choice of FDW? One possible enabling method is a > GUC (e.g. foreign_twophase_commit). It could be true/false, with true > meaning use PREPARE for all FDW writes and fail if that's not > supported, or it could be three-valued, like require/prefer/disable, > with require throwing an error if PREPARE support is not available and > prefer using PREPARE where available but without failing when it isn't > available. Another possibility could be to make it an FDW option, > possibly capable of being set at multiple levels (e.g. server or > foreign table). If any FDW involved in the transaction demands > distributed 2PC semantics then the whole transaction must have those > semantics or it fails. I was previous leaning toward the latter > approach, but I guess now the former approach is sounding better. I'm > not totally certain I know what's best here. > I agree that the former is better. That way, we also can control that parameter at transaction level. If we allow the 'prefer' behavior we need to manage not only 2PC-capable foreign server but also 2PC-non-capable foreign server. It requires all FDW to call the registration function. So I think two-values parameter would be better. BTW, sorry for late submitting the updated patch. I'll post the updated patch in this week but I'd like to share the new APIs design beforehand. APIs that I'd like to add are 4 functions and 1 registration function: PrepareForeignTransaction, CommitForeignTransaction, RollbackForeignTransaction, IsTwoPhaseCommitEnabled and FdwXactRegisterForeignServer. All FDWs that want to support atomic commit have to support all APIs and to call the registration function when foreign transaction opens. Transaction processing sequence with atomic commit will be like follows. 1. FDW begins a transaction on a 2PC-capable foreign server. 2. FDW registers the foreign server with/without a foreign transaction identifier by calling FdwXactRegisterForeignServer(). * The passing foreign transaction identifier can be NULL. If it's NULL, the core code constructs it like 'fx_<4 random chars>_<serverid>_<userid>'. * Providing foreign transaction identifier at beginning of transaction is useful because some DBMS such as Oracle database or MySQL requires a transaction identifier at beginning of its XA transaction. * Registration the foreign transaction guarantees that its transaction is controlled by the core and APIs are called at an appropriate time. 3. Perform 1 and 2 whenever the distributed transaction opens a transaction on 2PC-capable foreign servers. * When the distributed transaction modifies a foreign server, we mark it as 'modified'. * This mark is used at commit to check if it's necessary to use 2PC. * At the same time, we also check if the foreign server enables 2PC by calling IsTwoPhaseCommitEnabled(). * If an FDW disables or doesn't provide that function, we mark XACT_FALGS_FDWNONPREPARE. This is necessary because we need to remember wrote 2PC-non-capable foreign server. * When the distributed transaction modifies temp table locally, mark XACT_FALGS_WROTENONTEMREL. * This is used at commit to check i it's necessary to use 2PC as well. 4. During pre-commit, we prepare all foreign transaction if 2PC is required by calling PrepareFOreignTransaciton() * If we don't need to use 2PC, we commit all foreign transactions by calling CommitForeignTransaction() with 'prepared' == false. * If transaction raises an error during or until pre-commit for whatever reason, we rollback them calling RollbackForeignTransaction(). In case of rollback, we could call RollbackForeignTransaction() with 'prepared' == true but the corresponding foreign transaction might not exist. This is an API contract. 5. Local commit 6. Launch a foreign transaction resolver process and wait for it to resolve all foreign transactions. * The foreign transactions are resolved according to the status of local transaction by calling CommitForeignTransaciton or RollbackForeignTransaction() with 'prepared' == true. 7. After resolved all foreign transactions, the resolver process wake the waiting backend process up. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center