Thread: Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Tue, Jun 5, 2018 at 7:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Regarding to API design, should we use 2PC for a distributed
>>> transaction if both two or more 2PC-capable foreign servers and
>>> 2PC-non-capable foreign server are involved with it?  Or should we end
>>> up with an error? the 2PC-non-capable server might be either that has
>>> 2PC functionality but just disables it or that doesn't have it.
>>
>> It seems to me that this is functionality that many people will not
>> want to use.  First, doing a PREPARE and then a COMMIT for each FDW
>> write transaction is bound to be more expensive than just doing a
>> COMMIT.  Second, because the default value of
>> max_prepared_transactions is 0, this can only work at all if special
>> configuration has been done on the remote side.  Because of the second
>> point in particular, it seems to me that the default for this new
>> feature must be "off".  It would make to ship a default configuration
>> of PostgreSQL that doesn't work with the default configuration of
>> postgres_fdw, and I do not think we want to change the default value
>> of max_prepared_transactions.  It was changed from 5 to 0 a number of
>> years back for good reason.
>
> I'm not sure that many people will not want to use this feature
> because it seems to me that there are many people who don't want to
> use the database that is missing transaction atomicity. But I agree
> that this feature should not be enabled by default as we disable 2PC
> by default.
>
>>
>> So, I think the question could be broadened a bit: how you enable this
>> feature if you want it, and what happens if you want it but it's not
>> available for your choice of FDW?  One possible enabling method is a
>> GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
>> meaning use PREPARE for all FDW writes and fail if that's not
>> supported, or it could be three-valued, like require/prefer/disable,
>> with require throwing an error if PREPARE support is not available and
>> prefer using PREPARE where available but without failing when it isn't
>> available.  Another possibility could be to make it an FDW option,
>> possibly capable of being set at multiple levels (e.g. server or
>> foreign table).  If any FDW involved in the transaction demands
>> distributed 2PC semantics then the whole transaction must have those
>> semantics or it fails.  I was previous leaning toward the latter
>> approach, but I guess now the former approach is sounding better.  I'm
>> not totally certain I know what's best here.
>>
>
> I agree that the former is better. That way, we also can control that
> parameter at transaction level. If we allow the 'prefer' behavior we
> need to manage not only 2PC-capable foreign server but also
> 2PC-non-capable foreign server. It requires all FDW to call the
> registration function. So I think two-values parameter would be
> better.
>
> BTW, sorry for late submitting the updated patch. I'll post the
> updated patch in this week but I'd like to share the new APIs design
> beforehand.

Attached updated patches.

I've changed the new APIs to 5 functions and 1 registration function
because the rollback API can be called by both backend process and
resolver process which is not good design. The latest version patches
incorporated all comments I got except for documentation about overall
point to user. I'm considering what contents I should document it
there. I'll write it during the code patch is getting reviewed. The
basic design of new patches is almost same as the previous mail I
sent.

I introduced 5 new FDW APIs: PrepareForeignTransaction,
CommitForeignTransaction, RollbackForeignTransaction,
ResolveForeignTransaction and IsTwophaseCommitEnabled.
ResolveForeignTransaction is normally called by resolver process
whereas other four functions are called by backend process. Also I
introduced a registration function FdwXactRegisterForeignTransaction.
FDW that wish to support atomic commit requires to call this function
when a transaction opens on the foreign server. Registered foreign
transactions are controlled by the foreign transaction manager of
Postgres core and calls APIs at appropriate timing. It means that the
foreign transaction manager controls only foreign servers that are
capable of 2PC. For 2PC-non-capable foreign server, FDW must use
XactCallback to control the foreign transaction. 2PC is used at commit
when the distributed transaction modified data on two or more servers
including local server and user requested by foreign_twophase_commit
GUC parameter. All foreign transactions are prepared during pre-commit
and then commit locally. After committed locally wait for resolver
process to resolve all prepared foreign transactions. The waiting
backend is released (that is, returns the prompt to client) either
when all foreign transactions are resolved or when user requested to
waiting. If 2PC is not required, a foreign transaction is committed
during pre-commit phase of local transaction. IsTwophaseCommitEnabled
is called whenever the transaction begins to modify data on foreign
server. This is required to track whether the transaction modified
data on the foreign server that doesn't support or enable 2PC.

Atomic commit among multiple foreign servers is crash-safe. If the
coordinator server crashes during atomic commit, the foreign
transaction participants and their status are recovered during WAL
apply. Recovered foreign transactions are in doubt-state, aka dangling
transactions. If database has such transactions resolver process
periodically tries to resolve them.

I'll register this patch to next CF. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Mon, Jun 11, 2018 at 1:53 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jun 5, 2018 at 7:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> Regarding to API design, should we use 2PC for a distributed
>>>> transaction if both two or more 2PC-capable foreign servers and
>>>> 2PC-non-capable foreign server are involved with it?  Or should we end
>>>> up with an error? the 2PC-non-capable server might be either that has
>>>> 2PC functionality but just disables it or that doesn't have it.
>>>
>>> It seems to me that this is functionality that many people will not
>>> want to use.  First, doing a PREPARE and then a COMMIT for each FDW
>>> write transaction is bound to be more expensive than just doing a
>>> COMMIT.  Second, because the default value of
>>> max_prepared_transactions is 0, this can only work at all if special
>>> configuration has been done on the remote side.  Because of the second
>>> point in particular, it seems to me that the default for this new
>>> feature must be "off".  It would make to ship a default configuration
>>> of PostgreSQL that doesn't work with the default configuration of
>>> postgres_fdw, and I do not think we want to change the default value
>>> of max_prepared_transactions.  It was changed from 5 to 0 a number of
>>> years back for good reason.
>>
>> I'm not sure that many people will not want to use this feature
>> because it seems to me that there are many people who don't want to
>> use the database that is missing transaction atomicity. But I agree
>> that this feature should not be enabled by default as we disable 2PC
>> by default.
>>
>>>
>>> So, I think the question could be broadened a bit: how you enable this
>>> feature if you want it, and what happens if you want it but it's not
>>> available for your choice of FDW?  One possible enabling method is a
>>> GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
>>> meaning use PREPARE for all FDW writes and fail if that's not
>>> supported, or it could be three-valued, like require/prefer/disable,
>>> with require throwing an error if PREPARE support is not available and
>>> prefer using PREPARE where available but without failing when it isn't
>>> available.  Another possibility could be to make it an FDW option,
>>> possibly capable of being set at multiple levels (e.g. server or
>>> foreign table).  If any FDW involved in the transaction demands
>>> distributed 2PC semantics then the whole transaction must have those
>>> semantics or it fails.  I was previous leaning toward the latter
>>> approach, but I guess now the former approach is sounding better.  I'm
>>> not totally certain I know what's best here.
>>>
>>
>> I agree that the former is better. That way, we also can control that
>> parameter at transaction level. If we allow the 'prefer' behavior we
>> need to manage not only 2PC-capable foreign server but also
>> 2PC-non-capable foreign server. It requires all FDW to call the
>> registration function. So I think two-values parameter would be
>> better.
>>
>> BTW, sorry for late submitting the updated patch. I'll post the
>> updated patch in this week but I'd like to share the new APIs design
>> beforehand.
>
> Attached updated patches.
>
> I've changed the new APIs to 5 functions and 1 registration function
> because the rollback API can be called by both backend process and
> resolver process which is not good design. The latest version patches
> incorporated all comments I got except for documentation about overall
> point to user. I'm considering what contents I should document it
> there. I'll write it during the code patch is getting reviewed. The
> basic design of new patches is almost same as the previous mail I
> sent.
>
> I introduced 5 new FDW APIs: PrepareForeignTransaction,
> CommitForeignTransaction, RollbackForeignTransaction,
> ResolveForeignTransaction and IsTwophaseCommitEnabled.
> ResolveForeignTransaction is normally called by resolver process
> whereas other four functions are called by backend process. Also I
> introduced a registration function FdwXactRegisterForeignTransaction.
> FDW that wish to support atomic commit requires to call this function
> when a transaction opens on the foreign server. Registered foreign
> transactions are controlled by the foreign transaction manager of
> Postgres core and calls APIs at appropriate timing. It means that the
> foreign transaction manager controls only foreign servers that are
> capable of 2PC. For 2PC-non-capable foreign server, FDW must use
> XactCallback to control the foreign transaction. 2PC is used at commit
> when the distributed transaction modified data on two or more servers
> including local server and user requested by foreign_twophase_commit
> GUC parameter. All foreign transactions are prepared during pre-commit
> and then commit locally. After committed locally wait for resolver
> process to resolve all prepared foreign transactions. The waiting
> backend is released (that is, returns the prompt to client) either
> when all foreign transactions are resolved or when user requested to
> waiting. If 2PC is not required, a foreign transaction is committed
> during pre-commit phase of local transaction. IsTwophaseCommitEnabled
> is called whenever the transaction begins to modify data on foreign
> server. This is required to track whether the transaction modified
> data on the foreign server that doesn't support or enable 2PC.
>
> Atomic commit among multiple foreign servers is crash-safe. If the
> coordinator server crashes during atomic commit, the foreign
> transaction participants and their status are recovered during WAL
> apply. Recovered foreign transactions are in doubt-state, aka dangling
> transactions. If database has such transactions resolver process
> periodically tries to resolve them.
>
> I'll register this patch to next CF. Feedback is very welcome.
>

I attached the updated version patch as the previous versions conflict
with the current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Michael Paquier
Date:
On Fri, Aug 03, 2018 at 05:52:24PM +0900, Masahiko Sawada wrote:
> I attached the updated version patch as the previous versions conflict
> with the current HEAD.

Please note that the latest patch set does not apply anymore, so this
patch is moved to next CF, waiting on author.
--
Michael

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Tue, Oct 2, 2018 at 3:10 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 03, 2018 at 05:52:24PM +0900, Masahiko Sawada wrote:
> > I attached the updated version patch as the previous versions conflict
> > with the current HEAD.
>
> Please note that the latest patch set does not apply anymore, so this
> patch is moved to next CF, waiting on author.

Thank you! Attached the latest version patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Chris Travers
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            tested, failed

I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow.
 

I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can be
resolvedlater.  One thing I may have missed in the documentation is a discussion of the limits of the current approach.
I think this would be important to document because the caveats of the current approach are significant, but the people
whoneed it will have the knowledge to work with issues if they come up.
 

The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through
globaltransactions sequentially and does not move on to the next until the previous one is resolved.  This means that
ifI have a global transaction on server A, with foreign servers B and C, and I have another one on server A with
foreignservers C and D, if server B goes down at the wrong moment, the background worker does not look like it will
detectthe failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this
isresolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it.
 

It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy to
takea first crack at such docs.
 

To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we would
usethis feature once available.
 

------------------

Testing however failed.

make installcheck-world fails with errors like the following:

 -- Modify foreign server and raise an error
  BEGIN;
  INSERT INTO ft7_twophase VALUES(8);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft8_twophase VALUES(NULL); -- violation
! ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft7_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  -- Rollback foreign transaction that involves both 2PC-capable
  -- and 2PC-non-capable foreign servers.
  BEGIN;
  INSERT INTO ft8_twophase VALUES(7);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft9_not_twophase VALUES(7);
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.

make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a new
installation. I think the test cases will have to handle that sort of setup.
 

make check in the contrib directory passes.

For reasons of test failures, I am setting this back to waiting on author.

------------------
I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once it
isin place, this may open up more options for collaboration in the area of federated and distributed storage generally.
I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic handling
partso that extensions that do not use the foreign data wrapper structure could use it as well (while this looks like a
classicSQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 

The new status of this patch is: Waiting on Author

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Chris Travers
Date:


On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            tested, failed

Also one really minor point:  I think this is a typo (maX vs max)?

(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));


 

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Chris Travers
Date:


On Wed, Oct 3, 2018 at 9:56 AM Chris Travers <chris.travers@adjust.com> wrote:


On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:

(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));


Two more critical notes here which I think are small blockers.

The error message above references a config variable that does not exist.

The correct name of the config parameter is max_foreign_transaction_resolvers

 Setting that along with the following to 10 caused the tests to pass, but again it fails on default configs:

max_prepared_foreign_transactions, max_prepared_transactions


 

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin



--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Chris Travers
Date:


On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            tested, failed

I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so wanted to start on this while traffic is slow.

I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can be resolved later.  One thing I may have missed in the documentation is a discussion of the limits of the current approach.  I think this would be important to document because the caveats of the current approach are significant, but the people who need it will have the knowledge to work with issues if they come up.

The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through global transactions sequentially and does not move on to the next until the previous one is resolved.  This means that if I have a global transaction on server A, with foreign servers B and C, and I have another one on server A with foreign servers C and D, if server B goes down at the wrong moment, the background worker does not look like it will detect the failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this is resolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad transaction to allow processing to continue and manual resolution (this is good and necessary because in this area there is no ability to have perfect recoverability without occasional administrative action).  I would really like to see more documentation of failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum viable addition and I think we want it.

It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy to take a first crack at such docs.

After further testing I am pretty sure I misread the patch.  It looks like one can have multiple resolvers which can, in fact, work through a queue together solving this problem.  So the objection above is not valid and I withdraw that objection.  I will re-review the docs in light of the experience.
 

To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we would use this feature once available.

------------------

Testing however failed.

make installcheck-world fails with errors like the following:

 -- Modify foreign server and raise an error
  BEGIN;
  INSERT INTO ft7_twophase VALUES(8);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft8_twophase VALUES(NULL); -- violation
! ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft7_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  -- Rollback foreign transaction that involves both 2PC-capable
  -- and 2PC-non-capable foreign servers.
  BEGIN;
  INSERT INTO ft8_twophase VALUES(7);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft9_not_twophase VALUES(7);
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.

make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a new installation.  I think the test cases will have to handle that sort of setup.

make check in the contrib directory passes.

For reasons of test failures, I am setting this back to waiting on author.

------------------
I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once it is in place, this may open up more options for collaboration in the area of federated and distributed storage generally.  I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic handling part so that extensions that do not use the foreign data wrapper structure could use it as well (while this looks like a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API.

The new status of this patch is: Waiting on Author


--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Wed, Oct 3, 2018 at 6:02 PM Chris Travers <chris.travers@adjust.com> wrote:
>
>
>
> On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
>>
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:       not tested
>> Spec compliant:           not tested
>> Documentation:            tested, failed
>>
>> I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow. 
>>
>> I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can be
resolvedlater.  One thing I may have missed in the documentation is a discussion of the limits of the current approach.
I think this would be important to document because the caveats of the current approach are significant, but the people
whoneed it will have the knowledge to work with issues if they come up. 
>>
>> The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through
globaltransactions sequentially and does not move on to the next until the previous one is resolved.  This means that
ifI have a global transaction on server A, with foreign servers B and C, and I have another one on server A with
foreignservers C and D, if server B goes down at the wrong moment, the background worker does not look like it will
detectthe failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this
isresolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it. 
>>
>> It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy
totake a first crack at such docs. 
>

Thank you for reviewing the patch!

>
> After further testing I am pretty sure I misread the patch.  It looks like one can have multiple resolvers which can,
infact, work through a queue together solving this problem.  So the objection above is not valid and I withdraw that
objection. I will re-review the docs in light of the experience. 

Actually the patch doesn't solve this problem; the foreign transaction
resolver processes distributed transactions sequentially. But since
one resolver process is responsible for one database the backend
connecting to another database can complete the distributed
transaction. I understood the your concern and agreed to solve this
problem. I'll address it in the next patch.

>
>>
>>
>> To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we
woulduse this feature once available. 
>>
>> ------------------
>>
>> Testing however failed.
>>
>> make installcheck-world fails with errors like the following:
>>
>>  -- Modify foreign server and raise an error
>>   BEGIN;
>>   INSERT INTO ft7_twophase VALUES(8);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
>> ! ERROR:  current transaction is aborted, commands ignored until end of transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft7_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   -- Rollback foreign transaction that involves both 2PC-capable
>>   -- and 2PC-non-capable foreign servers.
>>   BEGIN;
>>   INSERT INTO ft8_twophase VALUES(7);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft9_not_twophase VALUES(7);
>> + ERROR:  current transaction is aborted, commands ignored until end of transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>
>> make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a
newinstallation.  I think the test cases will have to handle that sort of setup. 

The 'make installcheck' is a regression test mode to do the tests to
the existing installation. If the installation disables atomic commit
feature (e.g. max_prepared_foreign_transaction etc) the test will fail
because the feature is disabled by default.

>>
>> make check in the contrib directory passes.
>>
>> For reasons of test failures, I am setting this back to waiting on author.
>>
>> ------------------
>> I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once
itis in place, this may open up more options for collaboration in the area of federated and distributed storage
generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic
handlingpart so that extensions that do not use the foreign data wrapper structure could use it as well (while this
lookslike a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 
>>
>> The new status of this patch is: Waiting on Author

Also, I'll update the doc in the next patch that I'll post on this week.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 3, 2018 at 6:02 PM Chris Travers <chris.travers@adjust.com> wrote:
> >
> >
> >
> > On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
> >>
> >> The following review has been posted through the commitfest application:
> >> make installcheck-world:  tested, failed
> >> Implements feature:       not tested
> >> Spec compliant:           not tested
> >> Documentation:            tested, failed
> >>
> >> I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow. 
> >>
> >> I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can
beresolved later.  One thing I may have missed in the documentation is a discussion of the limits of the current
approach. I think this would be important to document because the caveats of the current approach are significant, but
thepeople who need it will have the knowledge to work with issues if they come up. 
> >>
> >> The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes
throughglobal transactions sequentially and does not move on to the next until the previous one is resolved.  This
meansthat if I have a global transaction on server A, with foreign servers B and C, and I have another one on server A
withforeign servers C and D, if server B goes down at the wrong moment, the background worker does not look like it
willdetect the failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until
thisis resolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it. 
> >>
> >> It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy
totake a first crack at such docs. 
> >
>
> Thank you for reviewing the patch!
>
> >
> > After further testing I am pretty sure I misread the patch.  It looks like one can have multiple resolvers which
can,in fact, work through a queue together solving this problem.  So the objection above is not valid and I withdraw
thatobjection.  I will re-review the docs in light of the experience. 
>
> Actually the patch doesn't solve this problem; the foreign transaction
> resolver processes distributed transactions sequentially. But since
> one resolver process is responsible for one database the backend
> connecting to another database can complete the distributed
> transaction. I understood the your concern and agreed to solve this
> problem. I'll address it in the next patch.
>
> >
> >>
> >>
> >> To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we
woulduse this feature once available. 
> >>
> >> ------------------
> >>
> >> Testing however failed.
> >>
> >> make installcheck-world fails with errors like the following:
> >>
> >>  -- Modify foreign server and raise an error
> >>   BEGIN;
> >>   INSERT INTO ft7_twophase VALUES(8);
> >> + ERROR:  prepread foreign transactions are disabled
> >> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
> >> ! ERROR:  current transaction is aborted, commands ignored until end of transaction block
> >>   ROLLBACK;
> >>   SELECT * FROM ft7_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   SELECT * FROM ft8_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   -- Rollback foreign transaction that involves both 2PC-capable
> >>   -- and 2PC-non-capable foreign servers.
> >>   BEGIN;
> >>   INSERT INTO ft8_twophase VALUES(7);
> >> + ERROR:  prepread foreign transactions are disabled
> >> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   INSERT INTO ft9_not_twophase VALUES(7);
> >> + ERROR:  current transaction is aborted, commands ignored until end of transaction block
> >>   ROLLBACK;
> >>   SELECT * FROM ft8_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>
> >> make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a
newinstallation.  I think the test cases will have to handle that sort of setup. 
>
> The 'make installcheck' is a regression test mode to do the tests to
> the existing installation. If the installation disables atomic commit
> feature (e.g. max_prepared_foreign_transaction etc) the test will fail
> because the feature is disabled by default.
>
> >>
> >> make check in the contrib directory passes.
> >>
> >> For reasons of test failures, I am setting this back to waiting on author.
> >>
> >> ------------------
> >> I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that
onceit is in place, this may open up more options for collaboration in the area of federated and distributed storage
generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic
handlingpart so that extensions that do not use the foreign data wrapper structure could use it as well (while this
lookslike a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 
> >>
> >> The new status of this patch is: Waiting on Author
>
> Also, I'll update the doc in the next patch that I'll post on this week.
>

Attached the updated version of patches. What I changed from the
previous version are,

* Enabled processing subsequent distributed transactions even when
previous distributed transaction continues to fail due to participants
error.
To implement this, I've splited the waiting queue into two queues: the
active queue and retry queue. All backend inserts itself to the active
queue firstly and change its state to FDW_XACT_WAITING. Once the
resolver process failed to resolve the distributed transaction, it
move the backend entry in the active queue to the retry queue and
change its state to FDW_XACT_WAITING_RETRY. The backend entries in the
active queue are processed each commit time whereas entries in the
retry queue are processed at interval of
foreign_transaction_resolution_retry_interval.

* Updated docs, added the new section "Distributed Transaction" at
Chapter 33 to explain the concept to users

* Moved atomic commit codes into src/backend/access/fdwxact directory.

* Some bug fixes.

Please reivew them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Kyotaro HORIGUCHI
Date:
Hello.

# It took a long time to come here..

At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
...
> * Updated docs, added the new section "Distributed Transaction" at
> Chapter 33 to explain the concept to users
> 
> * Moved atomic commit codes into src/backend/access/fdwxact directory.
> 
> * Some bug fixes.
> 
> Please reivew them.

I have some comments, with apologize in advance for possible
duplicate or conflict with others' comments so far.

0001:

This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
relation is modified. Isn't it needed when UNLOGGED tables are
modified? It may be better that we have dedicated classification
macro or function.

The flag is handled in heapam.c. I suppose that it should be done
in the upper layer considering coming pluggable storage.
(X_F_ACCESSEDTEMPREL is set in heapam, but..)


0002:

The name FdwXactParticipantsForAC doesn't sound good for me. How
about FdwXactAtomicCommitPartitcipants?

Well, as the file comment of fdwxact.c,
FdwXactRegisterTransaction is called from FDW driver and
F_X_MarkForeignTransactionModified is called from executor. I
think that we should clarify who is responsible to the whole
sequence. Since the state of local tables affects, I suppose
executor is that. Couldn't we do the whole thing within executor
side?  I'm not sure but I feel that
F_X_RegisterForeignTransaction can be a part of
F_X_MarkForeignTransactionModified.  The callers of
MarkForeignTransactionModified can find whether the table is
involved in 2pc by IsTwoPhaseCommitEnabled interface.


>     if (foreign_twophase_commit == true &&
>         ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
>         ereport(ERROR,
>                 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>                  errmsg("cannot COMMIT a distributed transaction that has operated on foreign server that doesn't
supportatomic commit")));
 

The error is emitted when a the GUC is turned off in the
trasaction where MarkTransactionModify'ed. I think that the
number of the variables' possible states should be reduced for
simplicity. For example in the case, once foreign_twopase_commit
is checked in a transaction, subsequent changes in the
transaction should be ignored during the transaction.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Hello.
>
> # It took a long time to come here..
>
> At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> ...
> > * Updated docs, added the new section "Distributed Transaction" at
> > Chapter 33 to explain the concept to users
> >
> > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> >
> > * Some bug fixes.
> >
> > Please reivew them.
>
> I have some comments, with apologize in advance for possible
> duplicate or conflict with others' comments so far.

Thank youf so much for reviewing this patch!

>
> 0001:
>
> This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> relation is modified. Isn't it needed when UNLOGGED tables are
> modified? It may be better that we have dedicated classification
> macro or function.

I think even if we do atomic commit for modifying the an UNLOGGED
table and a remote table the data will get inconsistent if the local
server crashes. For example, if the local server crashes after
prepared the transaction on foreign server but before the local commit
and, we will lose the all data of the local UNLOGGED table whereas the
modification of remote table is rollbacked. In case of persistent
tables, the data consistency is left. So I think the keeping data
consistency between remote data and local unlogged table is difficult
and want to leave it as a restriction for now. Am I missing something?

>
> The flag is handled in heapam.c. I suppose that it should be done
> in the upper layer considering coming pluggable storage.
> (X_F_ACCESSEDTEMPREL is set in heapam, but..)
>

Yeah, or we can set the flag after heap_insert in ExecInsert.

>
> 0002:
>
> The name FdwXactParticipantsForAC doesn't sound good for me. How
> about FdwXactAtomicCommitPartitcipants?

+1, will fix it.

>
> Well, as the file comment of fdwxact.c,
> FdwXactRegisterTransaction is called from FDW driver and
> F_X_MarkForeignTransactionModified is called from executor. I
> think that we should clarify who is responsible to the whole
> sequence. Since the state of local tables affects, I suppose
> executor is that. Couldn't we do the whole thing within executor
> side?  I'm not sure but I feel that
> F_X_RegisterForeignTransaction can be a part of
> F_X_MarkForeignTransactionModified.  The callers of
> MarkForeignTransactionModified can find whether the table is
> involved in 2pc by IsTwoPhaseCommitEnabled interface.

Indeed. We can register foreign servers by executor while FDWs don't
need to register anything. I will remove the registration function so
that FDW developers don't need to call the register function but only
need to provide atomic commit APIs.

>
>
> >       if (foreign_twophase_commit == true &&
> >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> >               ereport(ERROR,
> >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign server
thatdoesn't support atomic commit")));
 
>
> The error is emitted when a the GUC is turned off in the
> trasaction where MarkTransactionModify'ed. I think that the
> number of the variables' possible states should be reduced for
> simplicity. For example in the case, once foreign_twopase_commit
> is checked in a transaction, subsequent changes in the
> transaction should be ignored during the transaction.
>

I might have not gotten your comment correctly but since the
foreign_twophase_commit is a PGC_USERSET parameter I think we need to
check it at commit time. Also we need to keep participant servers even
when foreign_twophase_commit is off if both max_prepared_foreign_xacts
and max_foreign_xact_resolvers are > 0.

I will post the updated patch in this week.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > Hello.
> >
> > # It took a long time to come here..
> >
> > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > ...
> > > * Updated docs, added the new section "Distributed Transaction" at
> > > Chapter 33 to explain the concept to users
> > >
> > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > >
> > > * Some bug fixes.
> > >
> > > Please reivew them.
> >
> > I have some comments, with apologize in advance for possible
> > duplicate or conflict with others' comments so far.
>
> Thank youf so much for reviewing this patch!
>
> >
> > 0001:
> >
> > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > relation is modified. Isn't it needed when UNLOGGED tables are
> > modified? It may be better that we have dedicated classification
> > macro or function.
>
> I think even if we do atomic commit for modifying the an UNLOGGED
> table and a remote table the data will get inconsistent if the local
> server crashes. For example, if the local server crashes after
> prepared the transaction on foreign server but before the local commit
> and, we will lose the all data of the local UNLOGGED table whereas the
> modification of remote table is rollbacked. In case of persistent
> tables, the data consistency is left. So I think the keeping data
> consistency between remote data and local unlogged table is difficult
> and want to leave it as a restriction for now. Am I missing something?
>
> >
> > The flag is handled in heapam.c. I suppose that it should be done
> > in the upper layer considering coming pluggable storage.
> > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> >
>
> Yeah, or we can set the flag after heap_insert in ExecInsert.
>
> >
> > 0002:
> >
> > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > about FdwXactAtomicCommitPartitcipants?
>
> +1, will fix it.
>
> >
> > Well, as the file comment of fdwxact.c,
> > FdwXactRegisterTransaction is called from FDW driver and
> > F_X_MarkForeignTransactionModified is called from executor. I
> > think that we should clarify who is responsible to the whole
> > sequence. Since the state of local tables affects, I suppose
> > executor is that. Couldn't we do the whole thing within executor
> > side?  I'm not sure but I feel that
> > F_X_RegisterForeignTransaction can be a part of
> > F_X_MarkForeignTransactionModified.  The callers of
> > MarkForeignTransactionModified can find whether the table is
> > involved in 2pc by IsTwoPhaseCommitEnabled interface.
>
> Indeed. We can register foreign servers by executor while FDWs don't
> need to register anything. I will remove the registration function so
> that FDW developers don't need to call the register function but only
> need to provide atomic commit APIs.
>
> >
> >
> > >       if (foreign_twophase_commit == true &&
> > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > >               ereport(ERROR,
> > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> >
> > The error is emitted when a the GUC is turned off in the
> > trasaction where MarkTransactionModify'ed. I think that the
> > number of the variables' possible states should be reduced for
> > simplicity. For example in the case, once foreign_twopase_commit
> > is checked in a transaction, subsequent changes in the
> > transaction should be ignored during the transaction.
> >
>
> I might have not gotten your comment correctly but since the
> foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> check it at commit time. Also we need to keep participant servers even
> when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> and max_foreign_xact_resolvers are > 0.
>
> I will post the updated patch in this week.
>

Attached the updated version patches.

Based on the review comment from Horiguchi-san, I've changed the
atomic commit API so that the FDW developer who wish to support atomic
commit don't need to call the register function. The atomic commit
APIs are following:

* GetPrepareId
* PrepareForeignTransaction
* CommitForeignTransaction
* RollbackForeignTransaction
* ResolveForeignTransaction
* IsTwophaseCommitEnabled

The all APIs except for GetPreapreId is required for atomic commit.

Also, I've changed the foreign_twophase_commit parameter to an enum
parameter based on the suggestion from Robert[1]. Valid values are
'required', 'prefer' and 'disabled' (default). When set to either
'required' or 'prefer' the atomic commit will be used. The difference
between 'required' and 'prefer' is that when set to 'requried' we
require for *all* modified server to be able to use 2pc whereas when
'prefer' we require 2pc where available. So if any of written
participants disables 2pc or doesn't support atomic comit API the
transaction fails. IOW, when 'required' we can commit only when data
consistency among all participant can be left.

Please review the patches.

[1] https://www.postgresql.org/message-id/CA%2BTgmob4EqxbaMp0e--jUKYT44RL4xBXkPMxF9EEAD%2ByBGAdxw%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > >
> > > Hello.
> > >
> > > # It took a long time to come here..
> > >
> > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > ...
> > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > Chapter 33 to explain the concept to users
> > > >
> > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > >
> > > > * Some bug fixes.
> > > >
> > > > Please reivew them.
> > >
> > > I have some comments, with apologize in advance for possible
> > > duplicate or conflict with others' comments so far.
> >
> > Thank youf so much for reviewing this patch!
> >
> > >
> > > 0001:
> > >
> > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > modified? It may be better that we have dedicated classification
> > > macro or function.
> >
> > I think even if we do atomic commit for modifying the an UNLOGGED
> > table and a remote table the data will get inconsistent if the local
> > server crashes. For example, if the local server crashes after
> > prepared the transaction on foreign server but before the local commit
> > and, we will lose the all data of the local UNLOGGED table whereas the
> > modification of remote table is rollbacked. In case of persistent
> > tables, the data consistency is left. So I think the keeping data
> > consistency between remote data and local unlogged table is difficult
> > and want to leave it as a restriction for now. Am I missing something?
> >
> > >
> > > The flag is handled in heapam.c. I suppose that it should be done
> > > in the upper layer considering coming pluggable storage.
> > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > >
> >
> > Yeah, or we can set the flag after heap_insert in ExecInsert.
> >
> > >
> > > 0002:
> > >
> > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > about FdwXactAtomicCommitPartitcipants?
> >
> > +1, will fix it.
> >
> > >
> > > Well, as the file comment of fdwxact.c,
> > > FdwXactRegisterTransaction is called from FDW driver and
> > > F_X_MarkForeignTransactionModified is called from executor. I
> > > think that we should clarify who is responsible to the whole
> > > sequence. Since the state of local tables affects, I suppose
> > > executor is that. Couldn't we do the whole thing within executor
> > > side?  I'm not sure but I feel that
> > > F_X_RegisterForeignTransaction can be a part of
> > > F_X_MarkForeignTransactionModified.  The callers of
> > > MarkForeignTransactionModified can find whether the table is
> > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> >
> > Indeed. We can register foreign servers by executor while FDWs don't
> > need to register anything. I will remove the registration function so
> > that FDW developers don't need to call the register function but only
> > need to provide atomic commit APIs.
> >
> > >
> > >
> > > >       if (foreign_twophase_commit == true &&
> > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > >               ereport(ERROR,
> > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > >
> > > The error is emitted when a the GUC is turned off in the
> > > trasaction where MarkTransactionModify'ed. I think that the
> > > number of the variables' possible states should be reduced for
> > > simplicity. For example in the case, once foreign_twopase_commit
> > > is checked in a transaction, subsequent changes in the
> > > transaction should be ignored during the transaction.
> > >
> >
> > I might have not gotten your comment correctly but since the
> > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > check it at commit time. Also we need to keep participant servers even
> > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > and max_foreign_xact_resolvers are > 0.
> >
> > I will post the updated patch in this week.
> >
>
> Attached the updated version patches.
>
> Based on the review comment from Horiguchi-san, I've changed the
> atomic commit API so that the FDW developer who wish to support atomic
> commit don't need to call the register function. The atomic commit
> APIs are following:
>
> * GetPrepareId
> * PrepareForeignTransaction
> * CommitForeignTransaction
> * RollbackForeignTransaction
> * ResolveForeignTransaction
> * IsTwophaseCommitEnabled
>
> The all APIs except for GetPreapreId is required for atomic commit.
>
> Also, I've changed the foreign_twophase_commit parameter to an enum
> parameter based on the suggestion from Robert[1]. Valid values are
> 'required', 'prefer' and 'disabled' (default). When set to either
> 'required' or 'prefer' the atomic commit will be used. The difference
> between 'required' and 'prefer' is that when set to 'requried' we
> require for *all* modified server to be able to use 2pc whereas when
> 'prefer' we require 2pc where available. So if any of written
> participants disables 2pc or doesn't support atomic comit API the
> transaction fails. IOW, when 'required' we can commit only when data
> consistency among all participant can be left.
>
> Please review the patches.
>

Since the previous patch conflicts with current HEAD attached updated
set of patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > >
> > > > Hello.
> > > >
> > > > # It took a long time to come here..
> > > >
> > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > ...
> > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > Chapter 33 to explain the concept to users
> > > > >
> > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > >
> > > > > * Some bug fixes.
> > > > >
> > > > > Please reivew them.
> > > >
> > > > I have some comments, with apologize in advance for possible
> > > > duplicate or conflict with others' comments so far.
> > >
> > > Thank youf so much for reviewing this patch!
> > >
> > > >
> > > > 0001:
> > > >
> > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > modified? It may be better that we have dedicated classification
> > > > macro or function.
> > >
> > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > table and a remote table the data will get inconsistent if the local
> > > server crashes. For example, if the local server crashes after
> > > prepared the transaction on foreign server but before the local commit
> > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > modification of remote table is rollbacked. In case of persistent
> > > tables, the data consistency is left. So I think the keeping data
> > > consistency between remote data and local unlogged table is difficult
> > > and want to leave it as a restriction for now. Am I missing something?
> > >
> > > >
> > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > in the upper layer considering coming pluggable storage.
> > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > >
> > >
> > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > >
> > > >
> > > > 0002:
> > > >
> > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > about FdwXactAtomicCommitPartitcipants?
> > >
> > > +1, will fix it.
> > >
> > > >
> > > > Well, as the file comment of fdwxact.c,
> > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > think that we should clarify who is responsible to the whole
> > > > sequence. Since the state of local tables affects, I suppose
> > > > executor is that. Couldn't we do the whole thing within executor
> > > > side?  I'm not sure but I feel that
> > > > F_X_RegisterForeignTransaction can be a part of
> > > > F_X_MarkForeignTransactionModified.  The callers of
> > > > MarkForeignTransactionModified can find whether the table is
> > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > >
> > > Indeed. We can register foreign servers by executor while FDWs don't
> > > need to register anything. I will remove the registration function so
> > > that FDW developers don't need to call the register function but only
> > > need to provide atomic commit APIs.
> > >
> > > >
> > > >
> > > > >       if (foreign_twophase_commit == true &&
> > > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > >               ereport(ERROR,
> > > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > > >
> > > > The error is emitted when a the GUC is turned off in the
> > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > number of the variables' possible states should be reduced for
> > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > is checked in a transaction, subsequent changes in the
> > > > transaction should be ignored during the transaction.
> > > >
> > >
> > > I might have not gotten your comment correctly but since the
> > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > check it at commit time. Also we need to keep participant servers even
> > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > and max_foreign_xact_resolvers are > 0.
> > >
> > > I will post the updated patch in this week.
> > >
> >
> > Attached the updated version patches.
> >
> > Based on the review comment from Horiguchi-san, I've changed the
> > atomic commit API so that the FDW developer who wish to support atomic
> > commit don't need to call the register function. The atomic commit
> > APIs are following:
> >
> > * GetPrepareId
> > * PrepareForeignTransaction
> > * CommitForeignTransaction
> > * RollbackForeignTransaction
> > * ResolveForeignTransaction
> > * IsTwophaseCommitEnabled
> >
> > The all APIs except for GetPreapreId is required for atomic commit.
> >
> > Also, I've changed the foreign_twophase_commit parameter to an enum
> > parameter based on the suggestion from Robert[1]. Valid values are
> > 'required', 'prefer' and 'disabled' (default). When set to either
> > 'required' or 'prefer' the atomic commit will be used. The difference
> > between 'required' and 'prefer' is that when set to 'requried' we
> > require for *all* modified server to be able to use 2pc whereas when
> > 'prefer' we require 2pc where available. So if any of written
> > participants disables 2pc or doesn't support atomic comit API the
> > transaction fails. IOW, when 'required' we can commit only when data
> > consistency among all participant can be left.
> >
> > Please review the patches.
> >
>
> Since the previous patch conflicts with current HEAD attached updated
> set of patches.
>

Rebased and fixed a few bugs.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Thu, Nov 15, 2018 at 7:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > >
> > > > > Hello.
> > > > >
> > > > > # It took a long time to come here..
> > > > >
> > > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > ...
> > > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > > Chapter 33 to explain the concept to users
> > > > > >
> > > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > > >
> > > > > > * Some bug fixes.
> > > > > >
> > > > > > Please reivew them.
> > > > >
> > > > > I have some comments, with apologize in advance for possible
> > > > > duplicate or conflict with others' comments so far.
> > > >
> > > > Thank youf so much for reviewing this patch!
> > > >
> > > > >
> > > > > 0001:
> > > > >
> > > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > > modified? It may be better that we have dedicated classification
> > > > > macro or function.
> > > >
> > > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > > table and a remote table the data will get inconsistent if the local
> > > > server crashes. For example, if the local server crashes after
> > > > prepared the transaction on foreign server but before the local commit
> > > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > > modification of remote table is rollbacked. In case of persistent
> > > > tables, the data consistency is left. So I think the keeping data
> > > > consistency between remote data and local unlogged table is difficult
> > > > and want to leave it as a restriction for now. Am I missing something?
> > > >
> > > > >
> > > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > > in the upper layer considering coming pluggable storage.
> > > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > > >
> > > >
> > > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > > >
> > > > >
> > > > > 0002:
> > > > >
> > > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > > about FdwXactAtomicCommitPartitcipants?
> > > >
> > > > +1, will fix it.
> > > >
> > > > >
> > > > > Well, as the file comment of fdwxact.c,
> > > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > > think that we should clarify who is responsible to the whole
> > > > > sequence. Since the state of local tables affects, I suppose
> > > > > executor is that. Couldn't we do the whole thing within executor
> > > > > side?  I'm not sure but I feel that
> > > > > F_X_RegisterForeignTransaction can be a part of
> > > > > F_X_MarkForeignTransactionModified.  The callers of
> > > > > MarkForeignTransactionModified can find whether the table is
> > > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > > >
> > > > Indeed. We can register foreign servers by executor while FDWs don't
> > > > need to register anything. I will remove the registration function so
> > > > that FDW developers don't need to call the register function but only
> > > > need to provide atomic commit APIs.
> > > >
> > > > >
> > > > >
> > > > > >       if (foreign_twophase_commit == true &&
> > > > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > > >               ereport(ERROR,
> > > > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > > > >
> > > > > The error is emitted when a the GUC is turned off in the
> > > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > > number of the variables' possible states should be reduced for
> > > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > > is checked in a transaction, subsequent changes in the
> > > > > transaction should be ignored during the transaction.
> > > > >
> > > >
> > > > I might have not gotten your comment correctly but since the
> > > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > > check it at commit time. Also we need to keep participant servers even
> > > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > > and max_foreign_xact_resolvers are > 0.
> > > >
> > > > I will post the updated patch in this week.
> > > >
> > >
> > > Attached the updated version patches.
> > >
> > > Based on the review comment from Horiguchi-san, I've changed the
> > > atomic commit API so that the FDW developer who wish to support atomic
> > > commit don't need to call the register function. The atomic commit
> > > APIs are following:
> > >
> > > * GetPrepareId
> > > * PrepareForeignTransaction
> > > * CommitForeignTransaction
> > > * RollbackForeignTransaction
> > > * ResolveForeignTransaction
> > > * IsTwophaseCommitEnabled
> > >
> > > The all APIs except for GetPreapreId is required for atomic commit.
> > >
> > > Also, I've changed the foreign_twophase_commit parameter to an enum
> > > parameter based on the suggestion from Robert[1]. Valid values are
> > > 'required', 'prefer' and 'disabled' (default). When set to either
> > > 'required' or 'prefer' the atomic commit will be used. The difference
> > > between 'required' and 'prefer' is that when set to 'requried' we
> > > require for *all* modified server to be able to use 2pc whereas when
> > > 'prefer' we require 2pc where available. So if any of written
> > > participants disables 2pc or doesn't support atomic comit API the
> > > transaction fails. IOW, when 'required' we can commit only when data
> > > consistency among all participant can be left.
> > >
> > > Please review the patches.
> > >
> >
> > Since the previous patch conflicts with current HEAD attached updated
> > set of patches.
> >
>
> Rebased and fixed a few bugs.
>

I got feedbacks regarding transaciton management FDW APIs at Japan
PostgreSQL Developer Meetup[1] and am considering to change these APIs
to make them consistent with XA interface[2] (xa_prepare(),
xa_commit() and xa_rollback()) as follows[3].

* FdwXactResult PrepareForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult CommitForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult RollbackForeignTransaction(FdwXactState *state, inf flags)
* char *GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int
*prep_id_len)

Where flags set variaous setttings, currently it would contain only
FDW_XACT_FLAG_ONEPHASE that requires FDW to commit in one-phase (i.e.
without preparation). And where *state would contains information
necessary for specifying transaction: serverid, userid, usermappingid
and prepared id. GetPrepareId API is optional. Also I've removed the
two_phase_commit parameter from postgres_fdw options because we can
disable to use two-phase commit protocol for distributed transactions
using by distributed_atomic_commit GUC parameter.

Foreign transactions whose FDW provides both CommitForeignTransaction
API and RollbackForeignTransaction API will be managed by the global
transaction manager automatically. In addition, if the FDW also
provide PrepareForeignTransaction API it will participate to two-phase
commit protocol as a participant. So the existing FDWs that don't
provide transaction management FDW APIs can continue to work as before
even though this patch get committed.

The one point I'm concerned about this API design would be that since
both CommitForeignTransaction API and RollbackForeignTransaction API
will be used by two different kinds of process (backend and
transaction resolver processes), it might be hard to understand them
correctly for FDW developers.

I'd like to define new APIs so that FDW developers don't get confused.
Feedback is very welcome.

[1] https://wiki.postgresql.org/wiki/Japan_PostgreSQL_Developer_Meetup
[2] https://en.wikipedia.org/wiki/X/Open_XA
[3] The current API design I'm proposing has 6 APIs:  Prepare, Commit,
Rollback, Resolve, IsTwoPhaseEnabled and GetPrepareId. And these APIs
are devided based on who executes it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Ildar Musin
Date:
Hello,

The patch needs rebase as it doesn't apply to the current master. I applied it
to the older commit to test it. It worked fine so far.

I found one bug though which would cause resolver to finish by timeout even
though there are unresolved foreign transactions in the list. The
`fdw_xact_exists()` function expects database id as the first argument and xid
as the second. But everywhere it is called arguments specified in the different
order (xid first, then dbid).  Also function declaration in header doesn't
match its definition.

There are some other things I found.
* In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is
  declared as bool but used as integer.
* In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`
  and `FdwXactMarkForeignTransactionModified()` functions mentioned that are
  not there anymore.
* In documentation (storage.sgml) there is no mention of `pg_fdw_xact`
  directory.

Couple of stylistic notes.
* In `FdwXactCtlData struct` there are both camel case and snake case naming
  used.
* In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with
  `TransactionIdIsValid(xid)`.
* In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format
  string instead of being processed by `sprintf` as an extra argument.

I'll continue looking into the patch. Thanks!



On Tue, Nov 20, 2018 at 12:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Nov 15, 2018 at 7:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > >
> > > > > Hello.
> > > > >
> > > > > # It took a long time to come here..
> > > > >
> > > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > ...
> > > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > > Chapter 33 to explain the concept to users
> > > > > >
> > > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > > >
> > > > > > * Some bug fixes.
> > > > > >
> > > > > > Please reivew them.
> > > > >
> > > > > I have some comments, with apologize in advance for possible
> > > > > duplicate or conflict with others' comments so far.
> > > >
> > > > Thank youf so much for reviewing this patch!
> > > >
> > > > >
> > > > > 0001:
> > > > >
> > > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > > modified? It may be better that we have dedicated classification
> > > > > macro or function.
> > > >
> > > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > > table and a remote table the data will get inconsistent if the local
> > > > server crashes. For example, if the local server crashes after
> > > > prepared the transaction on foreign server but before the local commit
> > > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > > modification of remote table is rollbacked. In case of persistent
> > > > tables, the data consistency is left. So I think the keeping data
> > > > consistency between remote data and local unlogged table is difficult
> > > > and want to leave it as a restriction for now. Am I missing something?
> > > >
> > > > >
> > > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > > in the upper layer considering coming pluggable storage.
> > > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > > >
> > > >
> > > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > > >
> > > > >
> > > > > 0002:
> > > > >
> > > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > > about FdwXactAtomicCommitPartitcipants?
> > > >
> > > > +1, will fix it.
> > > >
> > > > >
> > > > > Well, as the file comment of fdwxact.c,
> > > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > > think that we should clarify who is responsible to the whole
> > > > > sequence. Since the state of local tables affects, I suppose
> > > > > executor is that. Couldn't we do the whole thing within executor
> > > > > side?  I'm not sure but I feel that
> > > > > F_X_RegisterForeignTransaction can be a part of
> > > > > F_X_MarkForeignTransactionModified.  The callers of
> > > > > MarkForeignTransactionModified can find whether the table is
> > > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > > >
> > > > Indeed. We can register foreign servers by executor while FDWs don't
> > > > need to register anything. I will remove the registration function so
> > > > that FDW developers don't need to call the register function but only
> > > > need to provide atomic commit APIs.
> > > >
> > > > >
> > > > >
> > > > > >       if (foreign_twophase_commit == true &&
> > > > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > > >               ereport(ERROR,
> > > > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign server that doesn't support atomic commit")));
> > > > >
> > > > > The error is emitted when a the GUC is turned off in the
> > > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > > number of the variables' possible states should be reduced for
> > > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > > is checked in a transaction, subsequent changes in the
> > > > > transaction should be ignored during the transaction.
> > > > >
> > > >
> > > > I might have not gotten your comment correctly but since the
> > > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > > check it at commit time. Also we need to keep participant servers even
> > > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > > and max_foreign_xact_resolvers are > 0.
> > > >
> > > > I will post the updated patch in this week.
> > > >
> > >
> > > Attached the updated version patches.
> > >
> > > Based on the review comment from Horiguchi-san, I've changed the
> > > atomic commit API so that the FDW developer who wish to support atomic
> > > commit don't need to call the register function. The atomic commit
> > > APIs are following:
> > >
> > > * GetPrepareId
> > > * PrepareForeignTransaction
> > > * CommitForeignTransaction
> > > * RollbackForeignTransaction
> > > * ResolveForeignTransaction
> > > * IsTwophaseCommitEnabled
> > >
> > > The all APIs except for GetPreapreId is required for atomic commit.
> > >
> > > Also, I've changed the foreign_twophase_commit parameter to an enum
> > > parameter based on the suggestion from Robert[1]. Valid values are
> > > 'required', 'prefer' and 'disabled' (default). When set to either
> > > 'required' or 'prefer' the atomic commit will be used. The difference
> > > between 'required' and 'prefer' is that when set to 'requried' we
> > > require for *all* modified server to be able to use 2pc whereas when
> > > 'prefer' we require 2pc where available. So if any of written
> > > participants disables 2pc or doesn't support atomic comit API the
> > > transaction fails. IOW, when 'required' we can commit only when data
> > > consistency among all participant can be left.
> > >
> > > Please review the patches.
> > >
> >
> > Since the previous patch conflicts with current HEAD attached updated
> > set of patches.
> >
>
> Rebased and fixed a few bugs.
>

I got feedbacks regarding transaciton management FDW APIs at Japan
PostgreSQL Developer Meetup[1] and am considering to change these APIs
to make them consistent with XA interface[2] (xa_prepare(),
xa_commit() and xa_rollback()) as follows[3].

* FdwXactResult PrepareForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult CommitForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult RollbackForeignTransaction(FdwXactState *state, inf flags)
* char *GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int
*prep_id_len)

Where flags set variaous setttings, currently it would contain only
FDW_XACT_FLAG_ONEPHASE that requires FDW to commit in one-phase (i.e.
without preparation). And where *state would contains information
necessary for specifying transaction: serverid, userid, usermappingid
and prepared id. GetPrepareId API is optional. Also I've removed the
two_phase_commit parameter from postgres_fdw options because we can
disable to use two-phase commit protocol for distributed transactions
using by distributed_atomic_commit GUC parameter.

Foreign transactions whose FDW provides both CommitForeignTransaction
API and RollbackForeignTransaction API will be managed by the global
transaction manager automatically. In addition, if the FDW also
provide PrepareForeignTransaction API it will participate to two-phase
commit protocol as a participant. So the existing FDWs that don't
provide transaction management FDW APIs can continue to work as before
even though this patch get committed.

The one point I'm concerned about this API design would be that since
both CommitForeignTransaction API and RollbackForeignTransaction API
will be used by two different kinds of process (backend and
transaction resolver processes), it might be hard to understand them
correctly for FDW developers.

I'd like to define new APIs so that FDW developers don't get confused.
Feedback is very welcome.

[1] https://wiki.postgresql.org/wiki/Japan_PostgreSQL_Developer_Meetup
[2] https://en.wikipedia.org/wiki/X/Open_XA
[3] The current API design I'm proposing has 6 APIs:  Prepare, Commit,
Rollback, Resolve, IsTwoPhaseEnabled and GetPrepareId. And these APIs
are devided based on who executes it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Tue, Jan 29, 2019 at 5:47 PM Ildar Musin <ildar@adjust.com> wrote:
>
> Hello,
>
> The patch needs rebase as it doesn't apply to the current master. I applied it
> to the older commit to test it. It worked fine so far.

Thank you for testing the patch!

>
> I found one bug though which would cause resolver to finish by timeout even
> though there are unresolved foreign transactions in the list. The
> `fdw_xact_exists()` function expects database id as the first argument and xid
> as the second. But everywhere it is called arguments specified in the different
> order (xid first, then dbid).  Also function declaration in header doesn't
> match its definition.

Will fix.

>
> There are some other things I found.
> * In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is
>   declared as bool but used as integer.
> * In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`
>   and `FdwXactMarkForeignTransactionModified()` functions mentioned that are
>   not there anymore.
> * In documentation (storage.sgml) there is no mention of `pg_fdw_xact`
>   directory.
>
> Couple of stylistic notes.
> * In `FdwXactCtlData struct` there are both camel case and snake case naming
>   used.
> * In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with
>   `TransactionIdIsValid(xid)`.
> * In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format
>   string instead of being processed by `sprintf` as an extra argument.
>

I'll incorporate them at the next patch set.

> I'll continue looking into the patch. Thanks!

Thanks. Actually I'm updating the patch set, changing API interface as
I proposed before and improving the document and README. I'll submit
the latest patch next week.


--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Michael Paquier
Date:
On Thu, Jan 31, 2019 at 11:09:09AM +0100, Masahiko Sawada wrote:
> Thanks. Actually I'm updating the patch set, changing API interface as
> I proposed before and improving the document and README. I'll submit
> the latest patch next week.

Cool, I have moved the patch to next CF.
--
Michael

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Thu, Jan 31, 2019 at 7:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 5:47 PM Ildar Musin <ildar@adjust.com> wrote:
> >
> > Hello,
> >
> > The patch needs rebase as it doesn't apply to the current master. I applied it
> > to the older commit to test it. It worked fine so far.
>
> Thank you for testing the patch!
>
> >
> > I found one bug though which would cause resolver to finish by timeout even
> > though there are unresolved foreign transactions in the list. The
> > `fdw_xact_exists()` function expects database id as the first argument and xid
> > as the second. But everywhere it is called arguments specified in the different
> > order (xid first, then dbid).  Also function declaration in header doesn't
> > match its definition.
>
> Will fix.
>
> >
> > There are some other things I found.
> > * In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is
> >   declared as bool but used as integer.
> > * In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`
> >   and `FdwXactMarkForeignTransactionModified()` functions mentioned that are
> >   not there anymore.
> > * In documentation (storage.sgml) there is no mention of `pg_fdw_xact`
> >   directory.
> >
> > Couple of stylistic notes.
> > * In `FdwXactCtlData struct` there are both camel case and snake case naming
> >   used.
> > * In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with
> >   `TransactionIdIsValid(xid)`.
> > * In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format
> >   string instead of being processed by `sprintf` as an extra argument.
> >
>
> I'll incorporate them at the next patch set.
>
> > I'll continue looking into the patch. Thanks!
>
> Thanks. Actually I'm updating the patch set, changing API interface as
> I proposed before and improving the document and README. I'll submit
> the latest patch next week.
>

Sorry for the very late. Attached updated version patches.

The basic mechanism has not been changed since the previous version.
But the updated version patch uses the single wait queue instead of
two queues (active and retry) which were used in the previous version.

Every backends processes has a timestamp in PGPROC
(fdwXactNextResolutionTs), that is the time when  they expect to be
processed by foreign resolver process at. Entries in the wait queue is
ordered by theirs timestamps. The wait queue and timestamp are used
after a backend process prepared all transactions on foreign servers
and wait for all of them to be resolved.

Backend processes who are committing/aborting the distributed
transaction insert itself to the wait queue
(FdwXactRslvCtl->fdwxact_queue) with the current timestamp, and then
request to launch a new resolver process if not launched yet. If there
is resolver connecting to the same database they just set its latch.
The wait queue is protected by LWLock FdwXactResolutionLock. Then the
backend sleep until either user requests to cancel (press ctrl-c) or
waken up by resolver process.

Foreign resolver process continue to poll the wait queue, checking if
there is any waiter on the database that the resolver process connects
to. If there is a waiter, fetches it and check its timestamp. If the
current timestamp goes over its timestamp, the resolver process start
to resolve all foreign transactions. Usually backends processes insert
itself to wait queue first then wake up the resolver and they use the
same wall-clock, so the resolver can fetch the waiter just inserted.
Once all foreign transactions are resolved, the resolver process
delete the backend entry from the wait queue, and then wake up the
waiting backend.

On failure during foreign transaction resolution, while the backend is
still sleeping, the resolver process removes and inserts the backend
with the new timestamp (its timestamp
foreign_transaction_resolution_interval) to appropriate position in
the wait queue. This mechanism ensures that a distributed transaction
is resolved as soon as the waiter inserted while ensuring that the
resolver can retry to resolve the failed foreign transactions at a
interval of foreign_transaction_resolution_interval time.

For handling in-doubt transactions, I've removed the automatically
foreign transaction resolution code from the first version patch since
it's not essential feature and we can add it later. Therefore user
needs to resolve unresolved foreign transactions manually using by
pg_resolve_fdwxacts() function in three cases: where the foreign
server crashed or we lost connectibility to it during preparing
foreign transaction, where the coordinator node crashed during
preparing/resolving the foreign transaction and where user canceled to
resolve the foreign transaction.

For foreign transaction resolver processes, they exit if they don't
have any foreign transaction to resolve longer than
foreign_transaction_resolver_timeout. Since we cannot drop a database
while a resolver process is connecting to we can stop it call by
pg_stop_fdwxact_resolver() function.

The comment at top of fdwxact.c file describes about locking mechanism
and recovery, and src/backend/fdwxact/README descries about status
transition of FdwXact.

Also the wiki page[1] describes how to use this feature with some examples.

[1] https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment
On Wed, Apr 17, 2019 at 10:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Sorry for the very late. Attached updated version patches.

Hello Sawada-san,

Can we please have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com



Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Apr 17, 2019 at 10:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Sorry for the very late. Attached updated version patches.
>
> Hello Sawada-san,
>
> Can we please have a fresh rebase?
>

Thank you for the notice. Attached rebased patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Alvaro Herrera
Date:
Hello Sawada-san,

On 2019-Jul-02, Masahiko Sawada wrote:

> On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

> > Can we please have a fresh rebase?
> 
> Thank you for the notice. Attached rebased patches.

... and again?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Wed, Sep 4, 2019 at 7:36 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Hello Sawada-san,
>
> On 2019-Jul-02, Masahiko Sawada wrote:
>
> > On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > > Can we please have a fresh rebase?
> >
> > Thank you for the notice. Attached rebased patches.
>
> ... and again?
>

Thank you for the notice. I've attached rebased patch set.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Wed, Sep 4, 2019 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 7:36 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > Hello Sawada-san,
> >
> > On 2019-Jul-02, Masahiko Sawada wrote:
> >
> > > On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > > > Can we please have a fresh rebase?
> > >
> > > Thank you for the notice. Attached rebased patches.
> >
> > ... and again?
> >
>
> Thank you for the notice. I've attached rebased patch set.

I forgot to include some new header files. Attached the updated patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Michael Paquier
Date:
On Wed, Sep 04, 2019 at 12:44:20PM +0900, Masahiko Sawada wrote:
> I forgot to include some new header files. Attached the updated patches.

No reviews since and the patch does not apply anymore.  I am moving it
to next CF, waiting on author.
--
Michael

Attachment

Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
Hello.

This is the reased (and a bit fixed) version of the patch. This
applies on the master HEAD and passes all provided tests.

I took over this work from Sawada-san. I'll begin with reviewing the
current patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 733f1e413ef2b2fe1d3ecba41eb4cd8e355ab826 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 16:59:47 +0900
Subject: [PATCH v26 1/5] Keep track of writing on non-temporary relation

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/backend/executor/nodeModifyTable.c | 12 ++++++++++++
 src/include/access/xact.h              |  6 ++++++
 2 files changed, 18 insertions(+)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e3eb9d7b90..cd91f9c8a8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -587,6 +587,10 @@ ExecInsert(ModifyTableState *mtstate,
                                estate->es_output_cid,
                                0, NULL);
 
+            /* Make note that we've wrote on non-temprary relation */
+            if (RelationNeedsWAL(resultRelationDesc))
+                MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
             /* insert index entries for tuple */
             if (resultRelInfo->ri_NumIndices > 0)
                 recheckIndexes = ExecInsertIndexTuples(slot, estate, false, NULL,
@@ -938,6 +942,10 @@ ldelete:;
     if (tupleDeleted)
         *tupleDeleted = true;
 
+    /* Make note that we've wrote on non-temprary relation */
+    if (RelationNeedsWAL(resultRelationDesc))
+        MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
     /*
      * If this delete is the result of a partition key update that moved the
      * tuple to a new partition, put this row into the transition OLD TABLE,
@@ -1447,6 +1455,10 @@ lreplace:;
             recheckIndexes = ExecInsertIndexTuples(slot, estate, false, NULL, NIL);
     }
 
+    /* Make note that we've wrote on non-temprary relation */
+    if (RelationNeedsWAL(resultRelationDesc))
+        MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
     if (canSetTag)
         (estate->es_processed)++;
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899dea1..cb5c4935d2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -102,6 +102,12 @@ extern int    MyXactFlags;
  */
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK    (1U << 1)
 
+/*
+ * XACT_FLAGS_WROTENONTEMPREL - set when we wrote data on non-temporary
+ * relation.
+ */
+#define XACT_FLAGS_WROTENONTEMPREL                (1U << 2)
+
 /*
  *    start- and end-of-transaction callbacks for dynamically loaded modules
  */
-- 
2.23.0

From d21c72a7db85c2211504f60fca8d39c0bd0ee5a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:00:50 +0900
Subject: [PATCH v26 2/5] Support atomic commit among multiple foreign servers.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/backend/access/Makefile                   |    2 +-
 src/backend/access/fdwxact/Makefile           |   17 +
 src/backend/access/fdwxact/README             |  130 +
 src/backend/access/fdwxact/fdwxact.c          | 2816 +++++++++++++++++
 src/backend/access/fdwxact/launcher.c         |  644 ++++
 src/backend/access/fdwxact/resolver.c         |  344 ++
 src/backend/access/rmgrdesc/Makefile          |    1 +
 src/backend/access/rmgrdesc/fdwxactdesc.c     |   58 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    6 +-
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |   42 +
 src/backend/access/transam/xact.c             |   27 +-
 src/backend/access/transam/xlog.c             |   34 +-
 src/backend/catalog/system_views.sql          |   11 +
 src/backend/commands/copy.c                   |    6 +
 src/backend/commands/foreigncmds.c            |   30 +
 src/backend/executor/execPartition.c          |    8 +
 src/backend/executor/nodeForeignscan.c        |   24 +
 src/backend/executor/nodeModifyTable.c        |   18 +
 src/backend/foreign/foreign.c                 |   57 +
 src/backend/postmaster/bgworker.c             |    8 +
 src/backend/postmaster/pgstat.c               |   20 +
 src/backend/postmaster/postmaster.c           |   15 +-
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/ipc/procarray.c           |   46 +
 src/backend/storage/lmgr/lwlocknames.txt      |    3 +
 src/backend/storage/lmgr/proc.c               |    8 +
 src/backend/tcop/postgres.c                   |   14 +
 src/backend/utils/misc/guc.c                  |   82 +
 src/backend/utils/misc/postgresql.conf.sample |   16 +
 src/backend/utils/probes.d                    |    2 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetwal/pg_resetwal.c             |    2 +
 src/bin/pg_waldump/fdwxactdesc.c              |    1 +
 src/bin/pg_waldump/rmgrdesc.c                 |    1 +
 src/include/access/fdwxact.h                  |  165 +
 src/include/access/fdwxact_launcher.h         |   29 +
 src/include/access/fdwxact_resolver.h         |   23 +
 src/include/access/fdwxact_xlog.h             |   54 +
 src/include/access/resolver_internal.h        |   66 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/twophase.h                 |    1 +
 src/include/access/xact.h                     |    7 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.dat               |   29 +
 src/include/foreign/fdwapi.h                  |   12 +
 src/include/foreign/foreign.h                 |    1 +
 src/include/pgstat.h                          |    9 +-
 src/include/storage/proc.h                    |   11 +
 src/include/storage/procarray.h               |    5 +
 src/include/utils/guc_tables.h                |    3 +
 src/test/regress/expected/rules.out           |   13 +
 55 files changed, 4917 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/fdwxact/Makefile
 create mode 100644 src/backend/access/fdwxact/README
 create mode 100644 src/backend/access/fdwxact/fdwxact.c
 create mode 100644 src/backend/access/fdwxact/launcher.c
 create mode 100644 src/backend/access/fdwxact/resolver.c
 create mode 100644 src/backend/access/rmgrdesc/fdwxactdesc.c
 create mode 120000 src/bin/pg_waldump/fdwxactdesc.c
 create mode 100644 src/include/access/fdwxact.h
 create mode 100644 src/include/access/fdwxact_launcher.h
 create mode 100644 src/include/access/fdwxact_resolver.h
 create mode 100644 src/include/access/fdwxact_xlog.h
 create mode 100644 src/include/access/resolver_internal.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..49480dd039 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS        = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-              table tablesample transam
+              table tablesample transam fdwxact
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/fdwxact/Makefile b/src/backend/access/fdwxact/Makefile
new file mode 100644
index 0000000000..0207a66fb4
--- /dev/null
+++ b/src/backend/access/fdwxact/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/fdwxact
+#
+# IDENTIFICATION
+#    src/backend/access/fdwxact/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/fdwxact
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = fdwxact.o resolver.o launcher.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/fdwxact/README b/src/backend/access/fdwxact/README
new file mode 100644
index 0000000000..46ccb7eeae
--- /dev/null
+++ b/src/backend/access/fdwxact/README
@@ -0,0 +1,130 @@
+src/backend/access/fdwxact/README
+
+Atomic Commit for Distributed Transactions
+===========================================
+
+The atomic commit feature enables us to commit and rollback either all of
+foreign servers or nothing. This ensures that the database data is always left
+in a conssitent state in term of federated database.
+
+
+Commit Sequence of Global Transactions
+--------------------------------
+
+We employee two-phase commit protocol to achieve commit among all foreign
+servers atomically. The sequence of distributed transaction commit consisnts
+of the following four steps:
+
+1. Foriegn Server Registration
+During executor node initialization, accessed foreign servers are registered
+to the list FdwXactAtomicCommitParticipants, which is maintained by
+PostgreSQL's the global transaction manager (GTM), as a distributed transaction
+participant The registered foreign transactions are tracked until the end of
+transaction.
+
+2. Pre-Commit phase (1st phase of two-phase commit)
+we record the corresponding WAL indicating that the foreign server is involved
+with the current transaction before doing PREPARE all foreign transactions.
+Thus in case we loose connectivity to the foreign server or crash ourselves,
+we will remember that we might have prepared tranascation on the foreign
+server, and try to resolve it when connectivity is restored or after crash
+recovery.
+
+The two-phase commit is required only if the transaction modified two or more
+servers including the local node. In other case, we can commit them at this
+step by calling CommitForeignTransaction() API and no need further operation.
+
+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change to
+rollback, therefore at this time some participants might be prepared whereas
+some are not prepared. The former foreign transactions need to be resolved
+using pg_resolve_foreign_xact() manually and the latter ends transaction
+in one-phase by calling RollbackForeignTransaction() API.
+
+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.
+
+4. Post-Commit Phase (2nd phase of two-phase commit)
+The steps so far are done by the backend process committing the transaction but
+this resolution step(commit or rollback) is done by the foreign transaction
+resolver process. The backend process inserts itselft to the wait queue, and
+then wake up the resolver process (or request to launch new one if necessary).
+The resolver process enqueue the waiter and fetch the distributed transaction
+information that the backend is waiting for. Once all foreign transaction are
+committed or rolbacked the resolver process wake up the waiter.
+
+
+API Contract With Transaction Management Callback Functions
+-----------------------------------------------------------
+
+The core GTM manages the status of individual foreign transactions and calls
+transaction management callback functions according to its status. Each
+callback functions PrepareForiegnTransaction, CommitForeignTransaction and
+RollbackForeignTransaction is responsible for either PREPARE, COMMIT or
+ROLLBACK the trasaction on the foreign server respectively.
+FdwXactRslvState->flags could contain FDWXACT_FLAG_ONEPHASE, meaning FDW can
+commit or rollback the foreign transactio in one-phase. On failure during
+processing a foreign transaction, FDW needs to raise an error. However, FDW
+must accept ERRCODE_UNDEFINED_OBJECT error during committing or rolling back a
+foreign transaction, because there is a race condition that the coordinator
+could crash in time between the resolution is completed and writing the WAL
+removing the FdwXact entry.
+
+
+Foreign Transactions Status
+----------------------------
+
+Every foreign transactions has an FdwXact entry. When preparing a foreign
+transaction a FdwXact entry of which status starts from FDWXACT_STATUS_INITIAL
+are created with WAL logging. The status changes to FDWXACT_STATUS_PREPARED
+after the foreign transaction is prepared and it changes to
+FDWXACT_STATUS_PREPARING, FDWXACT_STATUS_COMMITTING and FDWXACT_STATUS_ABORTING
+before the foreign transaction is prepared, committed and aborted by FDW
+callback functions respectively(*1). And the status then changes to
+FDWXACT_STATUS_RESOLVED once the foreign transaction are resolved, and then
+the corresponding FdwXact entry is removed with WAL logging. If failed during
+processing foreign transaction (i.g. preparing, committing or aborting) the
+status changes back to the previous status. Therefore the status
+FDWXACT_STATUS_xxxING appear only during the foreign transaction is being
+processed by an FDW callback function.
+
+FdwXact entries recovered during the recovery are marked as in-doubt if the
+corresponding local transaction is not prepared transaction. The initial
+status is FDWXACT_STATUS_PREPARED(*2). Because the foreign transaction was
+being processed we cannot know the exact status. So we regard it as PREPARED
+for safety.
+
+The foreign transaction status transition is illustrated by the following graph
+describing the FdwXact->status:
+
+ +----------------------------------------------------+
+ |                      INVALID                       |
+ +----------------------------------------------------+
+    |                      |                       |
+    |                      v                       |
+    |           +---------------------+            |
+    |           |       INITIAL       |            |
+    |           +---------------------+            |
+   (*2)                    |                      (*2)
+    |                      v                       |
+    |           +---------------------+            |
+    |           |    PREPARING(*1)    |            |
+    |           +---------------------+            |
+    |                      |                       |
+    v                      v                       v
+ +----------------------------------------------------+
+ |                      PREPARED                      |
+ +----------------------------------------------------+
+           |                               |
+           v                               v
+ +--------------------+          +--------------------+
+ |   COMMITTING(*1)   |          |    ABORTING(*1)    |
+ +--------------------+          +--------------------+
+           |                               |
+           v                               v
+ +----------------------------------------------------+
+ |                      RESOLVED                      |
+ +----------------------------------------------------+
+
+(*1) Status that appear only during being processed by FDW
+(*2) Paths for recovered FdwXact entries
diff --git a/src/backend/access/fdwxact/fdwxact.c b/src/backend/access/fdwxact/fdwxact.c
new file mode 100644
index 0000000000..058a416f81
--- /dev/null
+++ b/src/backend/access/fdwxact/fdwxact.c
@@ -0,0 +1,2816 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact.c
+ *        PostgreSQL global transaction manager for foreign servers.
+ *
+ * To achieve commit among all foreign servers automically, we employee
+ * two-phase commit protocol, which is a type of atomic commitment
+ * protocol(ACP). The basic strategy is that we prepare all of the remote
+ * transactions before committing locally and commit them after committing
+ * locally.
+ *
+ * During executor node initialization, they can register the foreign server
+ * by calling either RegisterFdwXactByRelId() or RegisterFdwXactByServerId()
+ * to participate it to a group for global commit. The foreign servers are
+ * registered if FDW has both CommitForeignTransaciton API and
+ * RollbackForeignTransactionAPI. Registered participant servers are identified
+ * by OIDs of foreign server and user.
+ *
+ * During pre-commit of local transaction, we prepare the transaction on
+ * foreign server everywhere. And after committing or rolling back locally,
+ * we notify the resolver process and tell it to commit or rollback those
+ * transactions. If we ask it to commit, we also tell it to notify us when
+ * it's done, so that we can wait interruptibly for it to finish, and so
+ * that we're not trying to locally do work that might fail after foreign
+ * transaction are committed.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * queue of waiting backends, so that we can avoid searching the through all
+ * foreign transactions each time we receive a request. We have one queue
+ * of which elements are ordered by the timestamp that they expect to be
+ * processed at. Before waiting for foreign transactions being resolved the
+ * backend enqueues with the timestamp that they expects to be processed.
+ * Similary if failed to resolve them, it enqueues again with new timestamp
+ * (its timestamp + foreign_xact_resolution_interval).
+ *
+ * If any network failure, server crash occurs or user stopped waiting
+ * prepared foreign transactions are left in in-doubt state (aka. in-doubt
+ * transaction). Foreign transactions in in-doubt state are not resolved
+ * automatically so must be processed manually using by pg_resovle_fdwxact()
+ * function.
+ *
+ * Two-phase commit protocol is required if the transaction modified two or
+ * more servers including itself. In other case, all foreign transactions are
+ * committed or rolled back during pre-commit.
+ *
+ * LOCKING
+ *
+ * Whenever a foreign transaction is processed by FDW, the corresponding
+ * FdwXact entry is update. In order to protect the entry from concurrent
+ * removing we need to hold a lock on the entry or a lock for entire global
+ * array. However, we don't want to hold the lock during FDW is processing the
+ * foreign transaction that may take a unpredictable time. To avoid this, the
+ * in-memory data of foreign transaction follows a locking model based on
+ * four linked concepts:
+ *
+ * * A foreign transaction's status variable is switched using the LWLock
+ *   FdwXactLock, which need to be hold in exclusive mode when updating the
+ *   status, while readers need to hold it in shared mode when looking at the
+ *   status.
+ * * A process who is going to update FdwXact entry cannot process foreign
+ *   transaction that is being resolved.
+ * * So setting the status to FDWACT_STATUS_PREPARING,
+ *   FDWXACT_STATUS_COMMITTING or FDWXACT_STATUS_ABORTING, which makes foreign
+ *   transaction in-progress states, means to own the FdwXact entry, which
+ *   protect it from updating/removing by concurrent writers.
+ * * Individual fields are protected by mutex where only the backend owning
+ *   the foreign transaction is authorized to update the fields from its own
+ *   one.
+
+ * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
+ * process who is going to call transaction callback functions needs to change
+ * the status to the corresponding status above while holding FdwXactLock in
+ * exclusive mode, and call callback function after releasing the lock.
+ *
+ * RECOVERY
+ *
+ * During replay WAL and replication FdwXactCtl also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdwxact records happens by the following rules:
+ *
+ * * At the beginning of recovery, pg_fdwxacts is scanned once, filling FdwXact
+ *   with entries marked with fdwxact->inredo and fdwxact->ondisk. FdwXact file
+ *   data older than the XID horizon of the redo position are discarded.
+ * * On PREPARE redo, the foreign transaction is added to FdwXactCtl->fdwxacts.
+ *   We set fdwxact->inredo to true for such entries.
+ * * On Checkpoint we iterate through FdwXactCtl->fdwxacts entries that
+ *   have fdwxact->inredo set and are behind the redo_horizon. We save
+ *   them to disk and then set fdwxact->ondisk to true.
+ * * On resolution we delete the entry from FdwXactCtl->fdwxacts. If
+ *   fdwxact->ondisk is true, the corresponding entry from the disk is
+ *   additionally deleted.
+ * * RecoverFdwXacts() and PrescanFdwXacts() have been modified to go through
+ *   fdwxact->inredo entries that have not made it to dink.
+ *
+ * These replay rules are borrowed from twophase.c
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/backend/access/fdwxact/fdwxact.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_xlog.h"
+#include "access/resolver_internal.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "parser/parsetree.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+#include "utils/rel.h"
+#include "utils/snapmgr.h"
+
+/* Atomic commit is enabled by configuration */
+#define IsForeignTwophaseCommitEnabled() \
+    (max_prepared_foreign_xacts > 0 && \
+     max_foreign_xact_resolvers > 0)
+
+/* Foreign twophase commit is enabled and requested by user */
+#define IsForeignTwophaseCommitRequested() \
+    (IsForeignTwophaseCommitEnabled() && \
+     (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED))
+
+/* Check the FdwXactParticipant is capable of two-phase commit  */
+#define IsSeverCapableOfTwophaseCommit(fdw_part) \
+    (((FdwXactParticipant *)(fdw_part))->prepare_foreign_xact_fn != NULL)
+
+/* Check the FdwXact is begin resolved */
+#define FdwXactIsBeingResolved(fx) \
+    (((((FdwXact)(fx))->status) == FDWXACT_STATUS_PREPARING) || \
+     ((((FdwXact)(fx))->status) == FDWXACT_STATUS_COMMITTING) || \
+     ((((FdwXact)(fx))->status) == FDWXACT_STATUS_ABORTING))
+
+/*
+ * Structure to bundle the foreign transaction participant. This struct
+ * is created at the beginning of execution for each foreign servers and
+ * is used until the end of transaction where we cannot look at syscaches.
+ * Therefore, this is allocated in the TopTransactionContext.
+ */
+typedef struct FdwXactParticipant
+{
+    /*
+     * Pointer to a FdwXact entry in the global array. NULL if the entry
+     * is not inserted yet but this is registered as a participant.
+     */
+    FdwXact        fdwxact;
+
+    /* Foreign server and user mapping info, passed to callback routines */
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+
+    /* Transaction identifier used for PREPARE */
+    char            *fdwxact_id;
+
+    /* true if modified the data on the server */
+    bool            modified;
+
+    /* Callbacks for foreign transaction */
+    PrepareForeignTransaction_function    prepare_foreign_xact_fn;
+    CommitForeignTransaction_function    commit_foreign_xact_fn;
+    RollbackForeignTransaction_function    rollback_foreign_xact_fn;
+    GetPrepareId_function                get_prepareid_fn;
+} FdwXactParticipant;
+
+/*
+ * List of foreign transaction participants for atomic commit. This list
+ * has only foreign servers that provides transaction management callbacks,
+ * that is CommitForeignTransaction and RollbackForeignTransaction.
+ */
+static List *FdwXactParticipants = NIL;
+static bool ForeignTwophaseCommitIsRequired = false;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDWXACTS_DIR "pg_fdwxact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes database oid,
+ * xid, foreign server oid and user oid separated by '_'.
+ *
+ * Since FdwXact stat file is created per foreign transaction in a
+ * distributed transaction and the xid of unresolved distributed
+ * transaction never reused, the name is fairly enough to ensure
+ * uniqueness.
+ */
+#define FDWXACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FdwXactFilePath(path, dbid, xid, serverid, userid)    \
+    snprintf(path, MAXPGPATH, FDWXACTS_DIR "/%08X_%08X_%08X_%08X", \
+             dbid, xid, serverid, userid)
+
+/* Guc parameters */
+int    max_prepared_foreign_xacts = 0;
+int    max_foreign_xact_resolvers = 0;
+int foreign_twophase_commit = FOREIGN_TWOPHASE_COMMIT_DISABLED;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+static FdwXact FdwXactInsertFdwXactEntry(TransactionId xid,
+                                         FdwXactParticipant *fdw_part);
+static void FdwXactPrepareForeignTransactions(void);
+static void FdwXactOnePhaseEndForeignTransaction(FdwXactParticipant *fdw_part,
+                                                 bool for_commit);
+static void FdwXactResolveForeignTransaction(FdwXact fdwxact,
+                                             FdwXactRslvState *state,
+                                             FdwXactStatus fallback_status);
+static void FdwXactComputeRequiredXmin(void);
+static void FdwXactCancelWait(void);
+static void FdwXactRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+static void FdwXactRedoRemove(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, bool give_warnings);
+static void FdwXactQueueInsert(PGPROC *waiter);
+static void AtProcExit_FdwXact(int code, Datum arg);
+static void ForgetAllFdwXactParticipants(void);
+static char *ReadFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                             Oid userid);
+static void RemoveFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, bool giveWarning);
+static void RecreateFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                                Oid userid,    void *content, int len);
+static void XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len);
+static char *ProcessFdwXactBuffer(Oid dbid, TransactionId local_xid,
+                                  Oid serverid, Oid userid,
+                                  XLogRecPtr insert_start_lsn,
+                                  bool from_disk);
+static void FdwXactDetermineTransactionFate(FdwXact fdwxact, bool need_lock);
+static bool is_foreign_twophase_commit_required(void);
+static void register_fdwxact(Oid serverid, Oid userid, bool modified);
+static List *get_fdwxacts(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                          bool including_indoubts, bool include_in_progress,
+                          bool need_lock);
+static FdwXact get_all_fdwxacts(int *num_p);
+static FdwXact insert_fdwxact(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, Oid umid, char *fdwxact_id);
+static char *get_fdwxact_identifier(FdwXactParticipant *fdw_part,
+                                    TransactionId xid);
+static void remove_fdwxact(FdwXact fdwxact);
+static FdwXact get_fdwxact_to_resolve(Oid dbid, TransactionId xid);
+static FdwXactRslvState *create_fdwxact_state(void);
+
+#ifdef USE_ASSERT_CHECKING
+static bool FdwXactQueueIsOrderedByTimestamp(void);
+#endif
+
+/*
+ * Remember accessed foreign transaction. Both RegisterFdwXactByRelId and
+ * RegisterFdwXactByServerId are called by executor during initialization.
+ */
+void
+RegisterFdwXactByRelId(Oid relid, bool modified)
+{
+    Relation        rel;
+    Oid                serverid;
+    Oid                userid;
+
+    rel = relation_open(relid, NoLock);
+    serverid = GetForeignServerIdByRelId(relid);
+    userid = rel->rd_rel->relowner ? rel->rd_rel->relowner : GetUserId();
+    relation_close(rel, NoLock);
+
+    register_fdwxact(serverid, userid, modified);
+}
+
+void
+RegisterFdwXactByServerId(Oid serverid, bool modified)
+{
+    register_fdwxact(serverid, GetUserId(), modified);
+}
+
+/*
+ * Register given foreign transaction identified by given arguments as
+ * a participant of the transaction.
+ *
+ * The foreign transaction identified by given server id and user id.
+ * Registered foreign transactions are managed by the global transaction
+ * manager until the end of the transaction.
+ */
+static void
+register_fdwxact(Oid serverid, Oid userid, bool modified)
+{
+    FdwXactParticipant    *fdw_part;
+    ForeignServer         *foreign_server;
+    UserMapping            *user_mapping;
+    MemoryContext        old_ctx;
+    FdwRoutine            *routine;
+    ListCell               *lc;
+
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant    *fdw_part = (FdwXactParticipant *) lfirst(lc);
+
+        if (fdw_part->server->serverid == serverid &&
+            fdw_part->usermapping->userid == userid)
+        {
+            /* The foreign server is already registered, return */
+            fdw_part->modified |= modified;
+            return;
+        }
+    }
+
+    /*
+     * Participant's information is also needed at the end of a transaction,
+     * where system cache are not available. Save it in TopTransactionContext
+     * so that these can live until the end of transaction.
+     */
+    old_ctx = MemoryContextSwitchTo(TopTransactionContext);
+    routine = GetFdwRoutineByServerId(serverid);
+
+    /*
+     * Don't register foreign server if it doesn't provide both commit and
+     * rollback transaction management callbacks.
+     */
+    if (!routine->CommitForeignTransaction ||
+        !routine->RollbackForeignTransaction)
+    {
+        MyXactFlags |= XACT_FLAGS_FDWNOPREPARE;
+        pfree(routine);
+        return;
+    }
+
+    /*
+     * Remember we touched the foreign server that is not capable of two-phase
+     * commit.
+     */
+    if (!routine->PrepareForeignTransaction)
+        MyXactFlags |= XACT_FLAGS_FDWNOPREPARE;
+
+    foreign_server = GetForeignServer(serverid);
+    user_mapping = GetUserMapping(userid, serverid);
+
+
+    fdw_part = (FdwXactParticipant *) palloc(sizeof(FdwXactParticipant));
+
+    fdw_part->fdwxact_id = NULL;
+    fdw_part->server = foreign_server;
+    fdw_part->usermapping = user_mapping;
+    fdw_part->fdwxact = NULL;
+    fdw_part->modified = modified;
+    fdw_part->prepare_foreign_xact_fn = routine->PrepareForeignTransaction;
+    fdw_part->commit_foreign_xact_fn = routine->CommitForeignTransaction;
+    fdw_part->rollback_foreign_xact_fn = routine->RollbackForeignTransaction;
+    fdw_part->get_prepareid_fn = routine->GetPrepareId;
+
+    /* Add to the participants list */
+    FdwXactParticipants = lappend(FdwXactParticipants, fdw_part);
+
+    /* Revert back the context */
+    MemoryContextSwitchTo(old_ctx);
+}
+
+/*
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+Size
+FdwXactShmemSize(void)
+{
+    Size        size;
+
+    /* Size for foreign transaction information array */
+    size = offsetof(FdwXactCtlData, fdwxacts);
+    size = add_size(size, mul_size(max_prepared_foreign_xacts,
+                                   sizeof(FdwXact)));
+    size = MAXALIGN(size);
+    size = add_size(size, mul_size(max_prepared_foreign_xacts,
+                                   sizeof(FdwXactData)));
+
+    return size;
+}
+
+/*
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of FdwXactCtlData
+ * structure.
+ */
+void
+FdwXactShmemInit(void)
+{
+    bool        found;
+
+    if (!fdwXactExitRegistered)
+    {
+        before_shmem_exit(AtProcExit_FdwXact, 0);
+        fdwXactExitRegistered = true;
+    }
+
+    FdwXactCtl = ShmemInitStruct("Foreign transactions table",
+                                 FdwXactShmemSize(),
+                                 &found);
+    if (!IsUnderPostmaster)
+    {
+        FdwXact        fdwxacts;
+        int            cnt;
+
+        Assert(!found);
+        FdwXactCtl->free_fdwxacts = NULL;
+        FdwXactCtl->num_fdwxacts = 0;
+
+        /* Initialize the linked list of free FDW transactions */
+        fdwxacts = (FdwXact)
+            ((char *) FdwXactCtl +
+             MAXALIGN(offsetof(FdwXactCtlData, fdwxacts) +
+                      sizeof(FdwXact) * max_prepared_foreign_xacts));
+        for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+        {
+            fdwxacts[cnt].status = FDWXACT_STATUS_INVALID;
+            fdwxacts[cnt].fdwxact_free_next = FdwXactCtl->free_fdwxacts;
+            FdwXactCtl->free_fdwxacts = &fdwxacts[cnt];
+            SpinLockInit(&(fdwxacts[cnt].mutex));
+        }
+    }
+    else
+    {
+        Assert(FdwXactCtl);
+        Assert(found);
+    }
+}
+
+/*
+ * Prepare all foreign transactions if foreign twophase commit is required.
+ * If foreign twophase commit is required, the behavior depends on the value
+ * of foreign_twophase_commit; when 'required' we strictly require for all
+ * foreign server's FDWs to support two-phase commit protocol and ask them to
+ *  prepare foreign transactions, when 'prefer' we ask only foreign servers
+ * that are capable of two-phase commit to prepare foreign transactions and ask
+ * for other servers to commit, and for 'disabled' we ask all foreign servers
+ * to commit foreign transaction in one-phase. If we failed to commit any of
+ * them we change to aborting.
+ *
+ * Note that non-modified foreign servers always can be committed without
+ * preparation.
+ */
+void
+PreCommit_FdwXacts(void)
+{
+    bool        need_twophase_commit;
+    ListCell    *lc = NULL;
+
+    /* If there are no foreign servers involved, we have no business here */
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /*
+     * we require all modified server have to be capable of two-phase
+     * commit protocol.
+     */
+    if (foreign_twophase_commit == FOREIGN_TWOPHASE_COMMIT_REQUIRED &&
+        (MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot COMMIT a distributed transaction that has operated on foreign server that doesn't
supportatomic commit")));
 
+
+    /*
+     * Check if we need to use foreign twophase commit. It's always false
+     * if foreign twophase commit is disabled.
+     */
+    need_twophase_commit = is_foreign_twophase_commit_required();
+
+    /*
+     * Firstly, we consider to commit foreign transactions in one-phase.
+     */
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
+        bool    commit = false;
+
+        /* Can commit in one-phase if two-phase commit is not requried */
+        if (!need_twophase_commit)
+            commit = true;
+
+        /* Non-modified foreign transaction always can be committed in one-phase */
+        if (!fdw_part->modified)
+            commit = true;
+
+        /*
+         * In 'prefer' case, non-twophase-commit capable server can be
+         * committed in one-phase.
+         */
+        if (foreign_twophase_commit == FOREIGN_TWOPHASE_COMMIT_PREFER &&
+            !IsSeverCapableOfTwophaseCommit(fdw_part))
+            commit = true;
+
+        if (commit)
+        {
+            /* Commit the foreign transaction in one-phase */
+            FdwXactOnePhaseEndForeignTransaction(fdw_part, true);
+
+            /* Delete it from the participant list */
+            FdwXactParticipants = foreach_delete_current(FdwXactParticipants,
+                                                         lc);
+            continue;
+        }
+    }
+
+    /* All done if we committed all foreign transactions */
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /*
+     * Secondary, if only one transaction is remained in the participant list
+     * and we didn't modified the local data we can commit it without
+     * preparation.
+     */
+    if (list_length(FdwXactParticipants) == 1 &&
+        (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) == 0)
+    {
+        /* Commit the foreign transaction in one-phase */
+        FdwXactOnePhaseEndForeignTransaction(linitial(FdwXactParticipants),
+                                             true);
+
+        /* All foreign transaction must be committed */
+        list_free(FdwXactParticipants);
+        return;
+    }
+
+    /*
+     * Finally, prepare foreign transactions. Note that we keep
+     * FdwXactParticipants until the end of transaction.
+     */
+    FdwXactPrepareForeignTransactions();
+}
+
+/*
+ * Insert FdwXact entries and prepare foreign transactions. Before inserting
+ * FdwXact entry we call get_preparedid callback to get a transaction
+ * identifier from FDW.
+ *
+ * We still can change to rollback here. If any error occurs, we rollback
+ * non-prepared foreign trasactions and leave others to the resolver.
+ */
+static void
+FdwXactPrepareForeignTransactions(void)
+{
+    ListCell        *lcell;
+    TransactionId    xid;
+
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /* Parameter check */
+    if (max_prepared_foreign_xacts == 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("prepread foreign transactions are disabled"),
+                 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+    if (max_foreign_xact_resolvers == 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("prepread foreign transactions are disabled"),
+                 errhint("Set max_foreign_transaction_resolvers to a nonzero value.")));
+
+    xid = GetTopTransactionId();
+
+    /* Loop over the foreign connections */
+    foreach(lcell, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lcell);
+        FdwXactRslvState     *state;
+        FdwXact        fdwxact;
+
+        fdw_part->fdwxact_id = get_fdwxact_identifier(fdw_part, xid);
+
+        Assert(fdw_part->fdwxact_id);
+
+        /*
+         * Insert the foreign transaction entry with the FDWXACT_STATUS_PREPARING
+         * status. Registration persists this information to the disk and logs
+         * (that way relaying it on standby). Thus in case we loose connectivity
+         * to the foreign server or crash ourselves, we will remember that we
+         * might have prepared transaction on the foreign server and try to
+         * resolve it when connectivity is restored or after crash recovery.
+         *
+         * If we prepare the transaction on the foreign server before persisting
+         * the information to the disk and crash in-between these two steps,
+         * we will forget that we prepared the transaction on the foreign server
+         * and will not be able to resolve it after the crash. Hence persist
+         * first then prepare.
+         */
+        fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
+
+        state = create_fdwxact_state();
+        state->server = fdw_part->server;
+        state->usermapping = fdw_part->usermapping;
+        state->fdwxact_id = pstrdup(fdw_part->fdwxact_id);
+
+        /* Update the status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        Assert(fdwxact->status == FDWXACT_STATUS_INITIAL);
+        fdwxact->status = FDWXACT_STATUS_PREPARING;
+        LWLockRelease(FdwXactLock);
+
+        /*
+         * Prepare the foreign transaction.
+         *
+         * Between FdwXactInsertFdwXactEntry call till this backend hears
+         * acknowledge from foreign server, the backend may abort the local
+         * transaction (say, because of a signal).
+         *
+         * During abort processing, we might try to resolve a never-preapred
+         * transaction, and get an error. This is fine as long as the FDW
+         * provides us unique prepared transaction identifiers.
+         */
+        PG_TRY();
+        {
+            fdw_part->prepare_foreign_xact_fn(state);
+        }
+        PG_CATCH();
+        {
+            /* failed, back to the initial state */
+            LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+            fdwxact->status = FDWXACT_STATUS_INITIAL;
+            LWLockRelease(FdwXactLock);
+
+            PG_RE_THROW();
+        }
+        PG_END_TRY();
+
+        /* succeeded, update status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        fdwxact->status = FDWXACT_STATUS_PREPARED;
+        LWLockRelease(FdwXactLock);
+    }
+}
+
+/*
+ * One-phase commit or rollback the given foreign transaction participant.
+ */
+static void
+FdwXactOnePhaseEndForeignTransaction(FdwXactParticipant *fdw_part,
+                                     bool for_commit)
+{
+    FdwXactRslvState *state;
+
+    Assert(fdw_part->commit_foreign_xact_fn);
+    Assert(fdw_part->rollback_foreign_xact_fn);
+
+    state = create_fdwxact_state();
+    state->server = fdw_part->server;
+    state->usermapping = fdw_part->usermapping;
+    state->flags = FDWXACT_FLAG_ONEPHASE;
+
+    /*
+     * Commit or rollback foreign transaction in one-phase. Since we didn't
+     * insert FdwXact entry for this transaction we don't need to care
+     * failures. On failure we change to rollback.
+     */
+    if (for_commit)
+        fdw_part->commit_foreign_xact_fn(state);
+    else
+        fdw_part->rollback_foreign_xact_fn(state);
+}
+
+/*
+ * This function is used to create new foreign transaction entry before an FDW
+ * prepares and commit/rollback. The function adds the entry to WAL and it will
+ * be persisted to the disk under pg_fdwxact directory when checkpoint.
+ */
+static FdwXact
+FdwXactInsertFdwXactEntry(TransactionId xid, FdwXactParticipant *fdw_part)
+{
+    FdwXact                fdwxact;
+    FdwXactOnDiskData    *fdwxact_file_data;
+    MemoryContext        old_context;
+    int                    data_len;
+
+    old_context = MemoryContextSwitchTo(TopTransactionContext);
+
+    /*
+     * Enter the foreign transaction in the shared memory structure.
+     */
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    fdwxact = insert_fdwxact(MyDatabaseId, xid, fdw_part->server->serverid,
+                            fdw_part->usermapping->userid,
+                            fdw_part->usermapping->umid, fdw_part->fdwxact_id);
+    fdwxact->status = FDWXACT_STATUS_INITIAL;
+    fdwxact->held_by = MyBackendId;
+    LWLockRelease(FdwXactLock);
+
+    fdw_part->fdwxact = fdwxact;
+    MemoryContextSwitchTo(old_context);
+
+    /*
+     * Prepare to write the entry to a file. Also add xlog entry. The contents
+     * of the xlog record are same as what is written to the file.
+     */
+    data_len = offsetof(FdwXactOnDiskData, fdwxact_id);
+    data_len = data_len + strlen(fdw_part->fdwxact_id) + 1;
+    data_len = MAXALIGN(data_len);
+    fdwxact_file_data = (FdwXactOnDiskData *) palloc0(data_len);
+    fdwxact_file_data->dbid = MyDatabaseId;
+    fdwxact_file_data->local_xid = xid;
+    fdwxact_file_data->serverid = fdw_part->server->serverid;
+    fdwxact_file_data->userid = fdw_part->usermapping->userid;
+    fdwxact_file_data->umid = fdw_part->usermapping->umid;
+    memcpy(fdwxact_file_data->fdwxact_id, fdw_part->fdwxact_id,
+           strlen(fdw_part->fdwxact_id) + 1);
+
+    /* See note in RecordTransactionCommit */
+    MyPgXact->delayChkpt = true;
+
+    START_CRIT_SECTION();
+
+    /* Add the entry in the xlog and save LSN for checkpointer */
+    XLogBeginInsert();
+    XLogRegisterData((char *) fdwxact_file_data, data_len);
+    fdwxact->insert_end_lsn = XLogInsert(RM_FDWXACT_ID, XLOG_FDWXACT_INSERT);
+    XLogFlush(fdwxact->insert_end_lsn);
+
+    /* If we crash now, we have prepared: WAL replay will fix things */
+
+    /* Store record's start location to read that later on CheckPoint */
+    fdwxact->insert_start_lsn = ProcLastRecPtr;
+
+    /* File is written completely, checkpoint can proceed with syncing */
+    fdwxact->valid = true;
+
+    /* Checkpoint can process now */
+    MyPgXact->delayChkpt = false;
+
+    END_CRIT_SECTION();
+
+    pfree(fdwxact_file_data);
+    return fdwxact;
+}
+
+/*
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, into the shared memory array. Caller
+ * must hold FdwXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FdwXact
+insert_fdwxact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                Oid umid, char *fdwxact_id)
+{
+    int i;
+    FdwXact fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* Check for duplicated foreign transaction entry */
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        fdwxact = FdwXactCtl->fdwxacts[i];
+        if (fdwxact->dbid == dbid &&
+            fdwxact->local_xid == xid &&
+            fdwxact->serverid == serverid &&
+            fdwxact->userid == userid)
+            ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
+                            errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
+                                   xid, serverid, userid)));
+    }
+
+    /*
+     * Get a next free foreign transaction entry. Raise error if there are
+     * none left.
+     */
+    if (!FdwXactCtl->free_fdwxacts)
+    {
+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("maximum number of foreign transactions reached"),
+                 errhint("Increase max_prepared_foreign_transactions: \"%d\".",
+                         max_prepared_foreign_xacts)));
+    }
+    fdwxact = FdwXactCtl->free_fdwxacts;
+    FdwXactCtl->free_fdwxacts = fdwxact->fdwxact_free_next;
+
+    /* Insert the entry to shared memory array */
+    Assert(FdwXactCtl->num_fdwxacts < max_prepared_foreign_xacts);
+    FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts++] = fdwxact;
+
+    fdwxact->held_by = InvalidBackendId;
+    fdwxact->dbid = dbid;
+    fdwxact->local_xid = xid;
+    fdwxact->serverid = serverid;
+    fdwxact->userid = userid;
+    fdwxact->umid = umid;
+    fdwxact->insert_start_lsn = InvalidXLogRecPtr;
+    fdwxact->insert_end_lsn = InvalidXLogRecPtr;
+    fdwxact->valid = false;
+    fdwxact->ondisk = false;
+    fdwxact->inredo = false;
+    fdwxact->indoubt = false;
+    memcpy(fdwxact->fdwxact_id, fdwxact_id, strlen(fdwxact_id) + 1);
+
+    return fdwxact;
+}
+
+/*
+ * Remove the foreign prepared transaction entry from shared memory.
+ * Caller must hold FdwXactLock in exclusive mode.
+ */
+static void
+remove_fdwxact(FdwXact fdwxact)
+{
+    int i;
+
+    Assert(fdwxact != NULL);
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    if (FdwXactIsBeingResolved(fdwxact))
+        elog(ERROR, "cannot remove fdwxact entry that is beging resolved");
+
+    /* Search the slot where this entry resided */
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        if (FdwXactCtl->fdwxacts[i] == fdwxact)
+            break;
+    }
+
+    /* We did not find the given entry in the array */
+    if (i >= FdwXactCtl->num_fdwxacts)
+        ereport(ERROR,
+                (errmsg("could not remove a foreign transaction entry"),
+                 errdetail("failed to find entry for xid %u, foreign server %u, and user %u",
+                           fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+
+    elog(DEBUG2, "remove fdwxact entry id %s, xid %u db %d user %d",
+         fdwxact->fdwxact_id, fdwxact->local_xid, fdwxact->dbid,
+         fdwxact->userid);
+
+    /* Remove the entry from active array */
+    FdwXactCtl->num_fdwxacts--;
+    FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts];
+
+    /* Put it back into free list */
+    fdwxact->fdwxact_free_next = FdwXactCtl->free_fdwxacts;
+    FdwXactCtl->free_fdwxacts = fdwxact;
+
+    /* Reset informations */
+    fdwxact->status = FDWXACT_STATUS_INVALID;
+    fdwxact->held_by = InvalidBackendId;
+    fdwxact->indoubt = false;
+
+    if (!RecoveryInProgress())
+    {
+        xl_fdwxact_remove record;
+        XLogRecPtr    recptr;
+
+        /* Fill up the log record before releasing the entry */
+        record.serverid = fdwxact->serverid;
+        record.dbid = fdwxact->dbid;
+        record.xid = fdwxact->local_xid;
+        record.userid = fdwxact->userid;
+
+        /*
+         * Now writing FdwXact state data to WAL. We have to set delayChkpt
+         * here, otherwise a checkpoint starting immediately after the
+         * WAL record is inserted could complete without fsync'ing our
+         * state file.  (This is essentially the same kind of race condition
+         * as the COMMIT-to-clog-write case that RecordTransactionCommit
+         * uses delayChkpt for; see notes there.)
+         */
+        START_CRIT_SECTION();
+
+        MyPgXact->delayChkpt = true;
+
+        /*
+         * Log that we are removing the foreign transaction entry and
+         * remove the file from the disk as well.
+         */
+        XLogBeginInsert();
+        XLogRegisterData((char *) &record, sizeof(xl_fdwxact_remove));
+        recptr = XLogInsert(RM_FDWXACT_ID, XLOG_FDWXACT_REMOVE);
+        XLogFlush(recptr);
+
+        /*
+         * Now we can mark ourselves as out of the commit critical section: a
+         * checkpoint starting after this will certainly see the gxact as a
+         * candidate for fsyncing.
+         */
+        MyPgXact->delayChkpt = false;
+
+        END_CRIT_SECTION();
+    }
+}
+
+/*
+ * Return true and set FdwXactAtomicCommitReady to true if the current transaction
+ * modified data on two or more servers in FdwXactParticipants and
+ * local server itself.
+ */
+static bool
+is_foreign_twophase_commit_required(void)
+{
+    ListCell*    lc;
+    int            nserverswritten = 0;
+
+    if (!IsForeignTwophaseCommitRequested())
+        return false;
+
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
+
+        if (fdw_part->modified)
+            nserverswritten++;
+    }
+
+    if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+        ++nserverswritten;
+
+    /*
+     * Atomic commit is required if we modified data on two or more
+     * participants.
+     */
+    if (nserverswritten <= 1)
+        return false;
+
+    ForeignTwophaseCommitIsRequired = true;
+    return true;
+}
+
+bool
+FdwXactIsForeignTwophaseCommitRequired(void)
+{
+    return ForeignTwophaseCommitIsRequired;
+}
+
+/*
+ * Compute the oldest xmin across all unresolved foreign transactions
+ * and store it in the ProcArray.
+ */
+static void
+FdwXactComputeRequiredXmin(void)
+{
+    int    i;
+    TransactionId agg_xmin = InvalidTransactionId;
+
+    Assert(FdwXactCtl != NULL);
+
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+
+        if (!fdwxact->valid)
+            continue;
+
+        Assert(TransactionIdIsValid(fdwxact->local_xid));
+
+        if (!TransactionIdIsValid(agg_xmin) ||
+            TransactionIdPrecedes(fdwxact->local_xid, agg_xmin))
+            agg_xmin = fdwxact->local_xid;
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    ProcArraySetFdwXactUnresolvedXmin(agg_xmin);
+}
+
+/*
+ * Mark my foreign transaction participants as in-doubt and clear
+ * the FdwXactParticipants list.
+ *
+ * If we leave any foreign transaction, update the oldest xmin of unresolved
+ * transaction so that local transaction id of in-doubt transaction is not
+ * truncated.
+ */
+static void
+ForgetAllFdwXactParticipants(void)
+{
+    ListCell *cell;
+    int        n_lefts = 0;
+
+    if (FdwXactParticipants == NIL)
+        return;
+
+    foreach(cell, FdwXactParticipants)
+    {
+        FdwXactParticipant    *fdw_part = (FdwXactParticipant *) lfirst(cell);
+        FdwXact fdwxact = fdw_part->fdwxact;
+
+        /* Nothing to do if didn't register FdwXact entry yet */
+        if (!fdw_part->fdwxact)
+            continue;
+
+        /*
+         * There is a race condition; the FdwXact entries in FdwXactParticipants
+         * could be used by other backend before we forget in case where the
+         * resolver process removes the FdwXact entry and other backend reuses
+         * it before we forget. So we need to check if the entries are still
+         * associated with the transaction.
+         */
+        SpinLockAcquire(&fdwxact->mutex);
+        if (fdwxact->held_by == MyBackendId)
+        {
+            fdwxact->held_by = InvalidBackendId;
+            fdwxact->indoubt = true;
+            n_lefts++;
+        }
+        SpinLockRelease(&fdwxact->mutex);
+    }
+
+    /*
+     * If we left any FdwXact entries, update the oldest local transaction of
+     * unresolved distributed transaction and take over them to the foreign
+     * transaction resolver.
+     */
+    if (n_lefts > 0)
+    {
+        elog(DEBUG1, "left %u foreign transactions in in-doubt status", n_lefts);
+        FdwXactComputeRequiredXmin();
+    }
+
+    FdwXactParticipants = NIL;
+}
+
+/*
+ * When the process exits, forget all the entries.
+ */
+static void
+AtProcExit_FdwXact(int code, Datum arg)
+{
+    ForgetAllFdwXactParticipants();
+}
+
+void
+FdwXactCleanupAtProcExit(void)
+{
+    if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+    {
+        LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+        SHMQueueDelete(&(MyProc->fdwXactLinks));
+        LWLockRelease(FdwXactResolutionLock);
+    }
+}
+
+/*
+ * Wait for the foreign transaction to be resolved.
+ *
+ * Initially backends start in state FDWXACT_NOT_WAITING and then change
+ * that state to FDWXACT_WAITING before adding ourselves to the wait queue.
+ * During FdwXactResolveForeignTransaction a fdwxact resolver changes the
+ * state to FDWXACT_WAIT_COMPLETE once all foreign transactions are resolved.
+ * This backend then resets its state to FDWXACT_NOT_WAITING.
+ * If a resolver fails to resolve the waiting transaction it moves us to
+ * the retry queue.
+ *
+ * This function is inspired by SyncRepWaitForLSN.
+ */
+void
+FdwXactWaitToBeResolved(TransactionId wait_xid, bool is_commit)
+{
+    char        *new_status = NULL;
+    const char    *old_status;
+
+    Assert(FdwXactCtl != NULL);
+    Assert(TransactionIdIsValid(wait_xid));
+    Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+    Assert(MyProc->fdwXactState == FDWXACT_NOT_WAITING);
+
+    /* Quick exit if atomic commit is not requested */
+    if (!IsForeignTwophaseCommitRequested())
+        return;
+
+    /*
+     * Also, exit if the transaction itself has no foreign transaction
+     * participants.
+     */
+    if (FdwXactParticipants == NIL && wait_xid == MyPgXact->xid)
+        return;
+
+    /* Set backend status and enqueue itself to the active queue */
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+    MyProc->fdwXactState = FDWXACT_WAITING;
+    MyProc->fdwXactWaitXid = wait_xid;
+    MyProc->fdwXactNextResolutionTs = GetCurrentTransactionStopTimestamp();
+    FdwXactQueueInsert(MyProc);
+    Assert(FdwXactQueueIsOrderedByTimestamp());
+    LWLockRelease(FdwXactResolutionLock);
+
+    /* Launch a resolver process if not yet, or wake up */
+    FdwXactLaunchOrWakeupResolver();
+
+    /*
+     * Alter ps display to show waiting for foreign transaction
+     * resolution.
+     */
+    if (update_process_title)
+    {
+        int len;
+
+        old_status = get_ps_display(&len);
+        new_status = (char *) palloc(len + 31 + 1);
+        memcpy(new_status, old_status, len);
+        sprintf(new_status + len, " waiting for resolution %d", wait_xid);
+        set_ps_display(new_status, false);
+        new_status[len] = '\0';    /* truncate off "waiting ..." */
+    }
+
+    /* Wait for all foreign transactions to be resolved */
+    for (;;)
+    {
+        /* Must reset the latch before testing state */
+        ResetLatch(MyLatch);
+
+        /*
+         * Acquiring the lock is not needed, the latch ensures proper
+         * barriers. If it looks like we're done, we must really be done,
+         * because once walsender changes the state to FDWXACT_WAIT_COMPLETE,
+         * it will never update it again, so we can't be seeing a stale value
+         * in that case.
+         */
+        if (MyProc->fdwXactState == FDWXACT_WAIT_COMPLETE)
+            break;
+
+        /*
+         * If a wait for foreign transaction resolution is pending, we can
+         * neither acknowledge the commit nor raise ERROR or FATAL.  The latter
+         * would lead the client to believe that the distributed transaction
+         * aborted, which is not true: it's already committed locally. The
+         * former is no good either: the client has requested committing a
+         * distributed transaction, and is entitled to assume that a acknowledged
+         * commit is also commit on all foreign servers, which might not be
+         * true. So in this case we issue a WARNING (which some clients may
+         * be able to interpret) and shut off further output. We do NOT reset
+         * PorcDiePending, so that the process will die after the commit is
+         * cleaned up.
+         */
+        if (ProcDiePending)
+        {
+            ereport(WARNING,
+                    (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                     errmsg("canceling the wait for resolving foreign transaction and terminating connection due to
administratorcommand"),
 
+                     errdetail("The transaction has already committed locally, but might not have been committed on
theforeign server.")));
 
+            whereToSendOutput = DestNone;
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * If a query cancel interrupt arrives we just terminate the wait with
+         * a suitable warning. The foreign transactions can be orphaned but
+         * the foreign xact resolver can pick up them and tries to resolve them
+         * later.
+         */
+        if (QueryCancelPending)
+        {
+            QueryCancelPending = false;
+            ereport(WARNING,
+                    (errmsg("canceling wait for resolving foreign transaction due to user request"),
+                     errdetail("The transaction has already committed locally, but might not have been committed on
theforeign server.")));
 
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * If the postmaster dies, we'll probably never get an
+         * acknowledgement, because all the wal sender processes will exit. So
+         * just bail out.
+         */
+        if (!PostmasterIsAlive())
+        {
+            ProcDiePending = true;
+            whereToSendOutput = DestNone;
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * Wait on latch.  Any condition that should wake us up will set the
+         * latch, so no need for timeout.
+         */
+        WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+                  WAIT_EVENT_FDWXACT_RESOLUTION);
+    }
+
+    pg_read_barrier();
+
+    Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+
+    if (new_status)
+    {
+        set_ps_display(new_status, false);
+        pfree(new_status);
+    }
+}
+
+/*
+ * Return true if there are at least one backend in the wait queue. The caller
+ * must hold FdwXactResolutionLock.
+ */
+bool
+FdwXactWaiterExists(Oid dbid)
+{
+    PGPROC *proc;
+
+    Assert(LWLockHeldByMeInMode(FdwXactResolutionLock, LW_SHARED));
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->databaseId == dbid)
+            return true;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    return false;
+}
+
+/*
+ * Insert the waiter to the wait queue in fdwXactNextResolutoinTs order.
+ */
+static void
+FdwXactQueueInsert(PGPROC *waiter)
+{
+    PGPROC *proc;
+
+    Assert(LWLockHeldByMeInMode(FdwXactResolutionLock, LW_EXCLUSIVE));
+
+    proc = (PGPROC *) SHMQueuePrev(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->fdwXactNextResolutionTs < waiter->fdwXactNextResolutionTs)
+            break;
+
+        proc = (PGPROC *) SHMQueuePrev(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    if (proc)
+        SHMQueueInsertAfter(&(proc->fdwXactLinks), &(waiter->fdwXactLinks));
+    else
+        SHMQueueInsertAfter(&(FdwXactRslvCtl->fdwxact_queue), &(waiter->fdwXactLinks));
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+FdwXactQueueIsOrderedByTimestamp(void)
+{
+    PGPROC *proc;
+    TimestampTz lastTs;
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+    lastTs = 0;
+
+    while (proc)
+    {
+
+        if (proc->fdwXactNextResolutionTs < lastTs)
+            return false;
+
+        lastTs = proc->fdwXactNextResolutionTs;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    return true;
+}
+#endif
+
+/*
+ * Acquire FdwXactResolutionLock and cancel any wait currently in progress.
+ */
+static void
+FdwXactCancelWait(void)
+{
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+    if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+        SHMQueueDelete(&(MyProc->fdwXactLinks));
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * AtEOXact_FdwXacts
+ */
+extern void
+AtEOXact_FdwXacts(bool is_commit)
+{
+    ListCell   *lcell;
+
+    if (!is_commit)
+    {
+        foreach (lcell, FdwXactParticipants)
+        {
+            FdwXactParticipant    *fdw_part = lfirst(lcell);
+
+            /*
+             * If the foreign transaction has FdwXact entry we might have
+             * prepared it. Skip already-prepared foreign transaction because
+             * it has closed its transaction. But we are not sure that foreign
+             * transaction with status == FDWXACT_STATUS_PREPARING has been
+             * prepared or not. So we call the rollback API to close its
+             * transaction for safety. The prepared foreign transaction that
+             * we might have will be resolved by the foreign transaction
+             * resolver.
+             */
+            if (fdw_part->fdwxact)
+            {
+                bool is_prepared;
+
+                LWLockAcquire(FdwXactLock, LW_SHARED);
+                is_prepared = fdw_part->fdwxact &&
+                    fdw_part->fdwxact->status == FDWXACT_STATUS_PREPARED;
+                LWLockRelease(FdwXactLock);
+
+                if (is_prepared)
+                    continue;
+            }
+
+            /* One-phase rollback foreign transaction */
+            FdwXactOnePhaseEndForeignTransaction(fdw_part, false);
+        }
+    }
+
+    /*
+     * In commit cases, we have already prepared foreign transactions during
+     * pre-commit phase. And these prepared transactions will be resolved by
+     * the resolver process.
+     */
+
+    ForgetAllFdwXactParticipants();
+    ForeignTwophaseCommitIsRequired = false;
+}
+
+/*
+ * Prepare foreign transactions.
+ *
+ * Note that it's possible that the transaction aborts after we prepared some
+ * of participants. In this case we change to rollback and rollback all foreign
+ * transactions.
+ */
+void
+AtPrepare_FdwXacts(void)
+{
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /* Check for an invalid condition */
+    if (!IsForeignTwophaseCommitRequested())
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot PREPARE a distributed transaction when foreign_twophase_commit is \'disabled\'")));
+
+    /*
+     * We cannot prepare if any foreign server of participants isn't capable
+     * of two-phase commit.
+     */
+    if (is_foreign_twophase_commit_required() &&
+        (MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot prepare the transaction because some foreign servers involved in transaction can not
preparethe transaction")));
 
+
+    /* Prepare transactions on participating foreign servers. */
+    FdwXactPrepareForeignTransactions();
+
+    FdwXactParticipants = NIL;
+}
+
+/*
+ * Return one backend that connects to my database and is waiting for
+ * resolution.
+ */
+PGPROC *
+FdwXactGetWaiter(TimestampTz *nextResolutionTs_p, TransactionId *waitXid_p)
+{
+    PGPROC *proc;
+
+    LWLockAcquire(FdwXactResolutionLock, LW_SHARED);
+    Assert(FdwXactQueueIsOrderedByTimestamp());
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->databaseId == MyDatabaseId)
+            break;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    if (proc)
+    {
+        *nextResolutionTs_p = proc->fdwXactNextResolutionTs;
+        *waitXid_p = proc->fdwXactWaitXid;
+    }
+    else
+    {
+        *nextResolutionTs_p = -1;
+        *waitXid_p = InvalidTransactionId;
+    }
+
+    LWLockRelease(FdwXactResolutionLock);
+
+    return proc;
+}
+
+/*
+ * Get one FdwXact entry to resolve. This function intended to be used when
+ * a resolver process get FdwXact entries to resolve. So we search entries
+ * while not including in-doubt transactions and in-progress transactions.
+ */
+static FdwXact
+get_fdwxact_to_resolve(Oid dbid, TransactionId xid)
+{
+    List *fdwxacts = NIL;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* Don't include both in-doubt transactions and in-progress transactions */
+    fdwxacts = get_fdwxacts(dbid, xid, InvalidOid, InvalidOid,
+                            false, false, false);
+
+    return fdwxacts == NIL ? NULL : (FdwXact) linitial(fdwxacts);
+}
+
+/*
+ * Resolve one distributed transaction on the given database . The target
+ * distributed transaction is fetched from the waiting queue and its transaction
+ * participants are fetched from the global array.
+ *
+ * Release the waiter and return true after we resolved the all of the foreign
+ * transaction participants. On failure, we re-enqueue the waiting backend after
+ * incremented the next resolution time.
+ */
+void
+FdwXactResolveTransactionAndReleaseWaiter(Oid dbid, TransactionId xid,
+                                          PGPROC *waiter)
+{
+    FdwXact    fdwxact;
+
+    Assert(TransactionIdIsValid(xid));
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    while ((fdwxact = get_fdwxact_to_resolve(MyDatabaseId, xid)) != NULL)
+    {
+        FdwXactRslvState *state;
+        ForeignServer *server;
+        UserMapping    *usermapping;
+
+        CHECK_FOR_INTERRUPTS();
+
+        server = GetForeignServer(fdwxact->serverid);
+        usermapping = GetUserMapping(fdwxact->userid, fdwxact->serverid);
+
+        state = create_fdwxact_state();
+        SpinLockAcquire(&fdwxact->mutex);
+        state->server = server;
+        state->usermapping = usermapping;
+        state->fdwxact_id = pstrdup(fdwxact->fdwxact_id);
+        SpinLockRelease(&fdwxact->mutex);
+
+        FdwXactDetermineTransactionFate(fdwxact, false);
+
+        /* Do not hold during foreign transaction resolution */
+        LWLockRelease(FdwXactLock);
+
+        PG_TRY();
+        {
+            /*
+             * Resolve the foreign transaction. When committing or aborting
+             * prepared foreign transactions the previous status is always
+             * FDWXACT_STATUS_PREPARED.
+             */
+            FdwXactResolveForeignTransaction(fdwxact, state,
+                                             FDWXACT_STATUS_PREPARED);
+        }
+        PG_CATCH();
+        {
+            /*
+             * Failed to resolve. Re-insert the waiter to the tail of retry
+             * queue if the waiter is still waiting.
+             */
+            LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+            if (waiter->fdwXactState == FDWXACT_WAITING)
+            {
+                SHMQueueDelete(&(waiter->fdwXactLinks));
+                pg_write_barrier();
+                waiter->fdwXactNextResolutionTs =
+                    TimestampTzPlusMilliseconds(waiter->fdwXactNextResolutionTs,
+                                                foreign_xact_resolution_retry_interval);
+                FdwXactQueueInsert(waiter);
+            }
+            LWLockRelease(FdwXactResolutionLock);
+
+            PG_RE_THROW();
+        }
+        PG_END_TRY();
+
+        elog(DEBUG2, "resolved one foreign transaction xid %u, serverid %d, userid %d",
+             fdwxact->local_xid, fdwxact->serverid, fdwxact->userid);
+
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+
+    /*
+     * Remove waiter from shmem queue, if not detached yet. The waiter
+     * could already be detached if user cancelled to wait before
+     * resolution.
+     */
+    if (!SHMQueueIsDetached(&(waiter->fdwXactLinks)))
+    {
+        TransactionId    wait_xid = waiter->fdwXactWaitXid;
+
+        SHMQueueDelete(&(waiter->fdwXactLinks));
+        pg_write_barrier();
+
+        /* Set state to complete */
+        waiter->fdwXactState = FDWXACT_WAIT_COMPLETE;
+
+        /* Wake up the waiter only when we have set state and removed from queue */
+        SetLatch(&(waiter->procLatch));
+
+        elog(DEBUG2, "released the proc with xid %u", wait_xid);
+    }
+    else
+        elog(DEBUG2, "the waiter backend had been already detached");
+
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * Determine whether the given foreign transaction should be committed or
+ * rolled back according to the result of the local transaction. This function
+ * changes fdwxact->status so the caller must hold FdwXactLock in exclusive
+ * mode or passing need_lock with true.
+ */
+static void
+FdwXactDetermineTransactionFate(FdwXact fdwxact, bool need_lock)
+{
+    if (need_lock)
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    /*
+     * The being resolved transaction must be either that has been cancelled
+     *  and marked as in-doubt or that has been prepared.
+     */
+    Assert(fdwxact->indoubt ||
+           fdwxact->status == FDWXACT_STATUS_PREPARED);
+
+    /*
+     * If the local transaction is already committed, commit prepared
+     * foreign transaction.
+     */
+    if (TransactionIdDidCommit(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_COMMITTING;
+
+    /*
+     * If the local transaction is already aborted, abort prepared
+     * foreign transactions.
+     */
+    else if (TransactionIdDidAbort(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_ABORTING;
+
+
+    /*
+     * The local transaction is not in progress but the foreign
+     * transaction is not prepared on the foreign server. This
+     * can happen when transaction failed after registered this
+     * entry but before actual preparing on the foreign server.
+     * So let's assume it aborted.
+     */
+    else if (!TransactionIdIsInProgress(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_ABORTING;
+
+    /*
+     * The Local transaction is in progress and foreign transaction is
+     * about to be committed or aborted. This should not happen except for one
+     * case where the local transaction is prepared and this foreign transaction
+     * is being resolved manually using by pg_resolve_foreign_xact(). Raise an
+     * error anyway since we cannot determine the fate of this foreign
+     * transaction according to the local transaction whose fate is also not
+     * determined.
+     */
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot resolve the foreign transaction associated with in-progress transaction %u on server
%u",
+                        fdwxact->local_xid, fdwxact->serverid),
+                 errhint("The local transaction with xid %u might be prepared",
+                         fdwxact->local_xid)));
+
+    if (need_lock)
+        LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * callback function. The 'state' is passed to the callback function. The fate of
+ * foreign transaction must be determined. If foreign transaction is resolved
+ * successfully, remove the FdwXact entry from the shared memory and also
+ * remove the corresponding on-disk file. If failed, the status of FdwXact
+ * entry changes to 'fallback_status' before erroring out.
+ */
+static void
+FdwXactResolveForeignTransaction(FdwXact fdwxact, FdwXactRslvState *state,
+                                 FdwXactStatus fallback_status)
+{
+    ForeignServer        *server;
+    ForeignDataWrapper    *fdw;
+    FdwRoutine            *fdw_routine;
+    bool                is_commit;
+
+    Assert(state != NULL);
+    Assert(state->server && state->usermapping && state->fdwxact_id);
+    Assert(fdwxact != NULL);
+
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    if (fdwxact->status != FDWXACT_STATUS_COMMITTING &&
+        fdwxact->status != FDWXACT_STATUS_ABORTING)
+        elog(ERROR, "cannot resolve foreign transaction whose fate is not determined");
+
+    is_commit = fdwxact->status == FDWXACT_STATUS_COMMITTING;
+    LWLockRelease(FdwXactLock);
+
+    server = GetForeignServer(fdwxact->serverid);
+    fdw = GetForeignDataWrapper(server->fdwid);
+    fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+
+    PG_TRY();
+    {
+        if (is_commit)
+            fdw_routine->CommitForeignTransaction(state);
+        else
+            fdw_routine->RollbackForeignTransaction(state);
+    }
+    PG_CATCH();
+    {
+        /* Back to the fallback status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        fdwxact->status = fallback_status;
+        LWLockRelease(FdwXactLock);
+
+        PG_RE_THROW();
+    }
+    PG_END_TRY();
+
+    /* Resolution was a success, remove the entry */
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    elog(DEBUG1, "successfully %s the foreign transaction with xid %u db %u server %u user %u",
+         is_commit ? "committed" : "rolled back",
+         fdwxact->local_xid, fdwxact->dbid, fdwxact->serverid,
+         fdwxact->userid);
+
+    fdwxact->status = FDWXACT_STATUS_RESOLVED;
+    if (fdwxact->ondisk)
+        RemoveFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                          fdwxact->serverid, fdwxact->userid,
+                          true);
+    remove_fdwxact(fdwxact);
+    LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Return palloc'd and initialized FdwXactRslvState.
+ */
+static FdwXactRslvState *
+create_fdwxact_state(void)
+{
+    FdwXactRslvState *state;
+
+    state = palloc(sizeof(FdwXactRslvState));
+    state->server = NULL;
+    state->usermapping = NULL;
+    state->fdwxact_id = NULL;
+    state->flags = 0;
+
+    return state;
+}
+
+/*
+ * Return at least one FdwXact entry that matches to given argument,
+ * otherwise return NULL. All arguments must be valid values so that it can
+ * search exactly one (or none) entry. Note that this function intended to be
+ * used for modifying the returned FdwXact entry, so the caller must hold
+ * FdwXactLock in exclusive mode and it doesn't include the in-progress
+ * FdwXact entries.
+ */
+static FdwXact
+get_one_fdwxact(Oid dbid, TransactionId xid, Oid serverid, Oid userid)
+{
+    List    *fdwxact_list;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* All search conditions must be valid values */
+    Assert(TransactionIdIsValid(xid));
+    Assert(OidIsValid(serverid));
+    Assert(OidIsValid(userid));
+    Assert(OidIsValid(dbid));
+
+    /* Include in-dbout transactions but don't include in-progress ones */
+    fdwxact_list = get_fdwxacts(dbid, xid, serverid, userid,
+                                true, false, false);
+
+    /* Must be one entry since we search it by the unique key */
+    Assert(list_length(fdwxact_list) <= 1);
+
+    /* Could not find entry */
+    if (fdwxact_list == NIL)
+        return NULL;
+
+    return (FdwXact) linitial(fdwxact_list);
+}
+
+/*
+ * Return true if there is at least one prepared foreign transaction
+ * which matches given arguments.
+ */
+bool
+fdwxact_exists(Oid dbid, Oid serverid, Oid userid)
+{
+    List    *fdwxact_list;
+
+    /* Find entries from all FdwXact entries */
+    fdwxact_list = get_fdwxacts(dbid, InvalidTransactionId, serverid,
+                                userid, true, true, true);
+
+    return fdwxact_list != NIL;
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_foreign_xacts, and the number of entries to num_p.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't
+ * want them.
+ *
+ * The returned array is palloc'd.
+ */
+static FdwXact
+get_all_fdwxacts(int *num_p)
+{
+    List        *all_fdwxacts;
+    ListCell    *lc;
+    FdwXact        fdwxacts;
+    int            num_fdwxacts = 0;
+
+    Assert(num_p != NULL);
+
+    /* Get all entries */
+    all_fdwxacts = get_fdwxacts(InvalidOid, InvalidTransactionId,
+                                InvalidOid, InvalidOid, true,
+                                true, true);
+
+    if (all_fdwxacts == NIL)
+    {
+        *num_p = 0;
+        return NULL;
+    }
+
+    fdwxacts = (FdwXact)
+        palloc(sizeof(FdwXactData) * list_length(all_fdwxacts));
+    *num_p = list_length(all_fdwxacts);
+
+    /* Convert list to array of FdwXact */
+    foreach(lc, all_fdwxacts)
+    {
+        FdwXact fx = (FdwXact) lfirst(lc);
+
+        memcpy(fdwxacts + num_fdwxacts, fx,
+               sizeof(FdwXactData));
+        num_fdwxacts++;
+    }
+
+    list_free(all_fdwxacts);
+
+    return fdwxacts;
+}
+
+/*
+ * Return a list of FdwXact matched to given arguments. Otherwise return NIL.
+ * The search condition is defined by arguments with valid values for
+ * respective datatypes. 'include_indoubt' and 'include_in_progress' are the
+ * option for that the result includes in-doubt transactions and in-progress
+ * transactions respecitively.
+ */
+static List*
+get_fdwxacts(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+             bool include_indoubt, bool include_in_progress, bool need_lock)
+{
+    int i;
+    List    *fdwxact_list = NIL;
+
+    if (need_lock)
+        LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact    fdwxact = FdwXactCtl->fdwxacts[i];
+
+        /* dbid */
+        if (OidIsValid(dbid) && fdwxact->dbid != dbid)
+            continue;
+
+        /* xid */
+        if (TransactionIdIsValid(xid) && xid != fdwxact->local_xid)
+            continue;
+
+        /* serverid */
+        if (OidIsValid(serverid) && serverid != fdwxact->serverid)
+            continue;
+
+        /* userid */
+        if (OidIsValid(userid) && fdwxact->userid != userid)
+            continue;
+
+        /* include in-doubt transaction? */
+        if (!include_indoubt && fdwxact->indoubt)
+            continue;
+
+        /* include in-progress transaction? */
+        if (!include_in_progress && FdwXactIsBeingResolved(fdwxact))
+            continue;
+
+        /* Append it if matched */
+        fdwxact_list = lappend(fdwxact_list, fdwxact);
+    }
+
+    if (need_lock)
+        LWLockRelease(FdwXactLock);
+
+    return fdwxact_list;
+}
+
+/* Apply the redo log for a foreign transaction */
+void
+fdwxact_redo(XLogReaderState *record)
+{
+    char       *rec = XLogRecGetData(record);
+    uint8        info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+    if (info == XLOG_FDWXACT_INSERT)
+    {
+        /*
+         * Add fdwxact entry and set start/end lsn of the WAL record
+         * in FdwXact entry.
+         */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        FdwXactRedoAdd(XLogRecGetData(record),
+                       record->ReadRecPtr,
+                       record->EndRecPtr);
+        LWLockRelease(FdwXactLock);
+    }
+    else if (info == XLOG_FDWXACT_REMOVE)
+    {
+        xl_fdwxact_remove *record = (xl_fdwxact_remove *) rec;
+
+        /* Delete FdwXact entry and file if exists */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        FdwXactRedoRemove(record->dbid, record->xid, record->serverid,
+                          record->userid, false);
+        LWLockRelease(FdwXactLock);
+    }
+    else
+        elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+    return;
+}
+
+/*
+ * Return a null-terminated foreign transaction identifier. If the given
+ * foreign server's FDW provides getPrepareId callback we return the identifier
+ * returned from it. Otherwise we generate an unique identifier with in the
+ * form of "fx_<random number>_<xid>_<serverid>_<userid> whose length is
+ * less than FDWXACT_ID_MAX_LEN.
+ *
+ * Returned string value is used to identify foreign transaction. The
+ * identifier should not be same as any other concurrent prepared transaction
+ * identifier.
+ *
+ * To make the foreign transactionid unique, we should ideally use something
+ * like UUID, which gives unique ids with high probability, but that may be
+ * expensive here and UUID extension which provides the function to generate
+ * UUID is not part of the core code.
+ */
+static char *
+get_fdwxact_identifier(FdwXactParticipant *fdw_part, TransactionId xid)
+{
+    char    *id;
+    int        id_len = 0;
+
+    if (!fdw_part->get_prepareid_fn)
+    {
+        char buf[FDWXACT_ID_MAX_LEN] = {0};
+
+        /*
+         * FDW doesn't provide the callback function, generate an unique
+         * idenetifier.
+         */
+        snprintf(buf, FDWXACT_ID_MAX_LEN, "fx_%ld_%u_%d_%d",
+             Abs(random()), xid, fdw_part->server->serverid,
+                 fdw_part->usermapping->userid);
+
+        return pstrdup(buf);
+    }
+
+    /* Get an unique identifier from callback function */
+    id = fdw_part->get_prepareid_fn(xid, fdw_part->server->serverid,
+                                    fdw_part->usermapping->userid,
+                                    &id_len);
+
+    if (id == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_UNDEFINED_OBJECT),
+                 (errmsg("foreign transaction identifier is not provided"))));
+
+    /* Check length of foreign transaction identifier */
+    if (id_len > FDWXACT_ID_MAX_LEN)
+    {
+        id[FDWXACT_ID_MAX_LEN] = '\0';
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("foreign transaction identifer \"%s\" is too long",
+                        id),
+                 errdetail("foreign transaction identifier must be less than %d characters.",
+                           FDWXACT_ID_MAX_LEN)));
+    }
+
+    id[id_len] = '\0';
+    return pstrdup(id);
+}
+
+/*
+ * We must fsync the foreign transaction state file that is valid or generated
+ * during redo and has a inserted LSN <= the checkpoint'S redo horizon.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ *
+ * This is deliberately run as late as possible in the checkpoint sequence,
+ * because FdwXacts ordinarily have short lifespans, and so it is quite
+ * possible that FdwXacts that were valid at checkpoint start will no longer
+ * exist if we wait a little bit. With typical checkpoint settings this
+ * will be about 3 minutes for an online checkpoint, so as a result we
+ * expect that there will be no FdwXacts that need to be copied to disk.
+ *
+ * If a FdwXact remains valid across multiple checkpoints, it will already
+ * be on disk so we don't bother to repeat that write.
+ */
+void
+CheckPointFdwXacts(XLogRecPtr redo_horizon)
+{
+    int            cnt;
+    int            serialized_fdwxacts = 0;
+
+    if (max_prepared_foreign_xacts <= 0)
+        return;                        /* nothing to do */
+
+    TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+    /*
+     * We are expecting there to be zero FdwXact that need to be copied to
+     * disk, so we perform all I/O while holding FdwXactLock for simplicity.
+     * This presents any new foreign xacts from preparing while this occurs,
+     * which shouldn't be a problem since the presence fo long-lived prepared
+     * foreign xacts indicated the transaction manager isn't active.
+     *
+     * It's also possible to move I/O out of the lock, but on every error we
+     * should check whether somebody committed our transaction in different
+     * backend. Let's leave this optimisation for future, if somebody will
+     * spot that this place cause bottleneck.
+     *
+     * Note that it isn't possible for there to be a FdwXact with a
+     * insert_end_lsn set prior to the last checkpoint yet is marked
+     * invalid, because of the efforts with delayChkpt.
+     */
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+    for (cnt = 0; cnt < FdwXactCtl->num_fdwxacts; cnt++)
+    {
+        FdwXact        fdwxact = FdwXactCtl->fdwxacts[cnt];
+
+        if ((fdwxact->valid || fdwxact->inredo) &&
+            !fdwxact->ondisk &&
+            fdwxact->insert_end_lsn <= redo_horizon)
+        {
+            char       *buf;
+            int            len;
+
+            XlogReadFdwXactData(fdwxact->insert_start_lsn, &buf, &len);
+            RecreateFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                                fdwxact->serverid, fdwxact->userid,
+                                buf, len);
+            fdwxact->ondisk = true;
+            fdwxact->insert_start_lsn = InvalidXLogRecPtr;
+            fdwxact->insert_end_lsn = InvalidXLogRecPtr;
+            pfree(buf);
+            serialized_fdwxacts++;
+        }
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    /*
+     * Flush unconditionally the parent directory to make any information
+     * durable on disk.  FdwXact files could have been removed and those
+     * removals need to be made persistent as well as any files newly created.
+     */
+    fsync_fname(FDWXACTS_DIR, true);
+
+    TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+    if (log_checkpoints && serialized_fdwxacts > 0)
+        ereport(LOG,
+              (errmsg_plural("%u foreign transaction state file was written "
+                             "for long-running prepared transactions",
+                             "%u foreign transaction state files were written "
+                             "for long-running prepared transactions",
+                             serialized_fdwxacts,
+                             serialized_fdwxacts)));
+}
+
+/*
+ * Reads foreign transaction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFdwXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+    XLogRecord *record;
+    XLogReaderState *xlogreader;
+    char       *errormsg;
+
+    xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+                                    &read_local_xlog_page, NULL);
+    if (!xlogreader)
+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("out of memory"),
+           errdetail("Failed while allocating an XLog reading processor.")));
+
+    record = XLogReadRecord(xlogreader, lsn, &errormsg);
+    if (record == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+        errmsg("could not read foreign transaction state from xlog at %X/%X",
+               (uint32) (lsn >> 32),
+               (uint32) lsn)));
+
+    if (XLogRecGetRmid(xlogreader) != RM_FDWXACT_ID ||
+        (XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDWXACT_INSERT)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+                        (uint32) (lsn >> 32),
+                        (uint32) lsn)));
+
+    if (len != NULL)
+        *len = XLogRecGetDataLen(xlogreader);
+
+    *buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+    memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+    XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay
+ * and during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                    Oid userid, void *content, int len)
+{
+    char        path[MAXPGPATH];
+    pg_crc32c    statefile_crc;
+    int            fd;
+
+    /* Recompute CRC */
+    INIT_CRC32C(statefile_crc);
+    COMP_CRC32C(statefile_crc, content, len);
+    FIN_CRC32C(statefile_crc);
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+
+    fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+
+    if (fd < 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+        errmsg("could not recreate foreign transaction state file \"%s\": %m",
+               path)));
+
+    /* Write content and CRC */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_WRITE);
+    if (write(fd, content, len) != len)
+    {
+        /* if write didn't set errno, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not write foreign transcation state file: %m")));
+    }
+    if (write(fd, &statefile_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+    {
+        if (errno == 0)
+            errno = ENOSPC;
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not write foreign transcation state file: %m")));
+    }
+    pgstat_report_wait_end();
+
+    /*
+     * We must fsync the file because the end-of-replay checkpoint will not do
+     * so, there being no FDWXACT in shared memory yet to tell it to.
+     */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_SYNC);
+    if (pg_fsync(fd) != 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not fsync foreign transaction state file: %m")));
+    pgstat_report_wait_end();
+
+    if (CloseTransientFile(fd) != 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not close foreign transaction file: %m")));
+}
+
+/*
+ * Given a transaction id, userid and serverid read it either from disk
+ * or read it directly via shmem xlog record pointer using the provided
+ * "insert_start_lsn".
+ */
+static char *
+ProcessFdwXactBuffer(Oid dbid, TransactionId xid, Oid serverid,
+                     Oid userid, XLogRecPtr insert_start_lsn, bool fromdisk)
+{
+    TransactionId    origNextXid =
+        XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+    char    *buf;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    if (!fromdisk)
+        Assert(!XLogRecPtrIsInvalid(insert_start_lsn));
+
+    /* Reject XID if too new */
+    if (TransactionIdFollowsOrEquals(xid, origNextXid))
+    {
+        if (fromdisk)
+        {
+            ereport(WARNING,
+                    (errmsg("removing future fdwxact state file for xid %u, server %u and user %u",
+                            xid, serverid, userid)));
+            RemoveFdwXactFile(dbid, xid, serverid, userid, true);
+        }
+        else
+        {
+            ereport(WARNING,
+                    (errmsg("removing future fdwxact state from memory for xid %u, server %u and user %u",
+                            xid, serverid, userid)));
+            FdwXactRedoRemove(dbid, xid, serverid, userid, true);
+        }
+        return NULL;
+    }
+
+    if (fromdisk)
+    {
+        /* Read and validate file */
+        buf = ReadFdwXactFile(dbid, xid, serverid, userid);
+    }
+    else
+    {
+        /* Read xlog data */
+        XlogReadFdwXactData(insert_start_lsn, &buf, NULL);
+    }
+
+    return buf;
+}
+
+/*
+ * Read and validate the foreign transaction state file.
+ *
+ * If it looks OK (has a valid magic number and CRC), return the palloc'd
+ * contents of the file, issuing an error when finding corrupted data.
+ * This state can be reached when doing recovery.
+ */
+static char *
+ReadFdwXactFile(Oid dbid, TransactionId xid, Oid serverid, Oid userid)
+{
+    char        path[MAXPGPATH];
+    int            fd;
+    FdwXactOnDiskData *fdwxact_file_data;
+    struct stat stat;
+    uint32        crc_offset;
+    pg_crc32c    calc_crc;
+    pg_crc32c    file_crc;
+    char       *buf;
+    int            r;
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+
+    fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+    if (fd < 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+               errmsg("could not open FDW transaction state file \"%s\": %m",
+                      path)));
+
+    /*
+     * Check file length.  We can determine a lower bound pretty easily. We
+     * set an upper bound to avoid palloc() failure on a corrupt file, though
+     * we can't guarantee that we won't get an out of memory error anyway,
+     * even on a valid file.
+     */
+    if (fstat(fd, &stat))
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not stat FDW transaction state file \"%s\": %m",
+                        path)));
+
+    if (stat.st_size < (offsetof(FdwXactOnDiskData, fdwxact_id) +
+                        sizeof(pg_crc32c)) ||
+        stat.st_size > MaxAllocSize)
+
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("too large FDW transaction state file \"%s\": %m",
+                        path)));
+
+    crc_offset = stat.st_size - sizeof(pg_crc32c);
+    if (crc_offset != MAXALIGN(crc_offset))
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("incorrect alignment of CRC offset for file \"%s\"",
+                        path)));
+
+    /*
+     * Ok, slurp in the file.
+     */
+    buf = (char *) palloc(stat.st_size);
+    fdwxact_file_data = (FdwXactOnDiskData *) buf;
+
+    /* Slurp the file */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_READ);
+    r = read(fd, buf, stat.st_size);
+    if (r != stat.st_size)
+    {
+        if (r < 0)
+            ereport(ERROR,
+                    (errcode_for_file_access(),
+                     errmsg("could not read file \"%s\": %m", path)));
+        else
+            ereport(ERROR,
+                    (errmsg("could not read file \"%s\": read %d of %zu",
+                            path, r, (Size) stat.st_size)));
+    }
+    pgstat_report_wait_end();
+
+    if (CloseTransientFile(fd))
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not close file \"%s\": %m", path)));
+
+    /*
+     * Check the CRC.
+     */
+    INIT_CRC32C(calc_crc);
+    COMP_CRC32C(calc_crc, buf, crc_offset);
+    FIN_CRC32C(calc_crc);
+
+    file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+    if (!EQ_CRC32C(calc_crc, file_crc))
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("calculated CRC checksum does not match value stored in file \"%s\"",
+                        path)));
+
+    /* Check if the contents is an expected data */
+    fdwxact_file_data = (FdwXactOnDiskData *) buf;
+    if (fdwxact_file_data->dbid  != dbid ||
+        fdwxact_file_data->serverid != serverid ||
+        fdwxact_file_data->userid != userid ||
+        fdwxact_file_data->local_xid != xid)
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("invalid foreign transaction state file \"%s\"",
+                        path)));
+
+    return buf;
+}
+
+/*
+ * Scan the shared memory entries of FdwXact and determine the range of valid
+ * XIDs present.  This is run during database startup, after we have completed
+ * reading WAL.  ShmemVariableCache->nextFullXid has been set to one more than
+ * the highest XID for which evidence exists in WAL.
+
+ * On corrupted two-phase files, fail immediately.  Keeping around broken
+ * entries and let replay continue causes harm on the system, and a new
+ * backup should be rolled in.
+
+ * Our other responsibility is to update and return the oldest valid XID
+ * among the distributed transactions. This is needed to synchronize pg_subtrans
+ * startup properly.
+ */
+TransactionId
+PrescanFdwXacts(TransactionId oldestActiveXid)
+{
+    FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
+    TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
+    TransactionId result = origNextXid;
+    int i;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+        char *buf;
+
+        buf = ProcessFdwXactBuffer(fdwxact->dbid, fdwxact->local_xid,
+                                   fdwxact->serverid, fdwxact->userid,
+                                   fdwxact->insert_start_lsn, fdwxact->ondisk);
+
+        if (buf == NULL)
+            continue;
+
+        if (TransactionIdPrecedes(fdwxact->local_xid, result))
+            result = fdwxact->local_xid;
+
+        pfree(buf);
+    }
+    LWLockRelease(FdwXactLock);
+
+    return result;
+}
+
+/*
+ * Scan pg_fdwxact and fill FdwXact depending on the on-disk data.
+ * This is called once at the beginning of recovery, saving any extra
+ * lookups in the future.  FdwXact files that are newer than the
+ * minimum XID horizon are discarded on the way.
+ */
+void
+restoreFdwXactData(void)
+{
+    DIR           *cldir;
+    struct dirent *clde;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    cldir = AllocateDir(FDWXACTS_DIR);
+    while ((clde = ReadDir(cldir, FDWXACTS_DIR)) != NULL)
+    {
+        if (strlen(clde->d_name) == FDWXACT_FILE_NAME_LEN &&
+            strspn(clde->d_name, "0123456789ABCDEF_") == FDWXACT_FILE_NAME_LEN)
+        {
+            TransactionId local_xid;
+            Oid            dbid;
+            Oid            serverid;
+            Oid            userid;
+            char        *buf;
+
+            sscanf(clde->d_name, "%08x_%08x_%08x_%08x",
+                   &dbid, &local_xid, &serverid, &userid);
+
+            /* Read fdwxact data from disk */
+            buf = ProcessFdwXactBuffer(dbid, local_xid, serverid, userid,
+                                       InvalidXLogRecPtr, true);
+
+            if (buf == NULL)
+                continue;
+
+            /* Add this entry into the table of foreign transactions */
+            FdwXactRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+        }
+    }
+
+    LWLockRelease(FdwXactLock);
+    FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFdwXactFile(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                  bool giveWarning)
+{
+    char        path[MAXPGPATH];
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+    if (unlink(path) < 0 && (errno != ENOENT || giveWarning))
+        ereport(WARNING,
+                (errcode_for_file_access(),
+                 errmsg("could not remove foreign transaction state file \"%s\": %m",
+                        path)));
+}
+
+/*
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdwxact entry in shared memory FdwXactData structure.
+ */
+static void
+FdwXactRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+    FdwXactOnDiskData *fdwxact_data = (FdwXactOnDiskData *) buf;
+    FdwXact fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+    Assert(RecoveryInProgress());
+
+    /*
+     * Add this entry into the table of foreign transactions. The
+     * status of the transaction is set as preparing, since we do not
+     * know the exact status right now. Resolver will set it later
+     * based on the status of local transaction which prepared this
+     * foreign transaction.
+     */
+    fdwxact = insert_fdwxact(fdwxact_data->dbid, fdwxact_data->local_xid,
+                              fdwxact_data->serverid, fdwxact_data->userid,
+                              fdwxact_data->umid, fdwxact_data->fdwxact_id);
+
+    elog(DEBUG2, "added fdwxact entry in shared memory for foreign transaction, db %u xid %u server %u user %u id
%s",
+         fdwxact_data->dbid, fdwxact_data->local_xid,
+         fdwxact_data->serverid, fdwxact_data->userid,
+         fdwxact_data->fdwxact_id);
+
+    /*
+     * Set status as PREPARED and as in-doubt, since we do not know
+     * the xact status right now. Resolver will set it later based on
+     * the status of local transaction that prepared this fdwxact entry.
+     */
+    fdwxact->status = FDWXACT_STATUS_PREPARED;
+    fdwxact->insert_start_lsn = start_lsn;
+    fdwxact->insert_end_lsn = end_lsn;
+    fdwxact->inredo = true;    /* added in redo */
+    fdwxact->indoubt = true;
+    fdwxact->valid = false;
+    fdwxact->ondisk = XLogRecPtrIsInvalid(start_lsn);
+}
+
+/*
+ * Remove the corresponding fdwxact entry from FdwXactCtl. Also remove
+ * FdwXact file if a foreign transaction was saved via an earlier checkpoint.
+ * We could not found the FdwXact entry in the case where a crash recovery
+ * starts from the point where is after added but before removed the entry.
+ */
+void
+FdwXactRedoRemove(Oid dbid, TransactionId xid, Oid serverid,
+                  Oid userid, bool givewarning)
+{
+    FdwXact    fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+    Assert(RecoveryInProgress());
+
+    fdwxact = get_one_fdwxact(dbid, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+        return;
+
+    elog(DEBUG2, "removed fdwxact entry from shared memory for foreign transaction, db %u xid %u server %u user %u id
%s",
+         fdwxact->dbid, fdwxact->local_xid, fdwxact->serverid,
+         fdwxact->userid, fdwxact->fdwxact_id);
+
+    /* Clean up entry and any files we may have left */
+    if (fdwxact->ondisk)
+        RemoveFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                          fdwxact->serverid, fdwxact->userid,
+                          givewarning);
+    remove_fdwxact(fdwxact);
+}
+
+/*
+ * Scan the shared memory entries of FdwXact and valid them.
+ *
+ * This is run at the end of recovery, but before we allow backends to write
+ * WAL.
+ */
+void
+RecoverFdwXacts(void)
+{
+    int i;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+        char    *buf;
+
+        buf = ProcessFdwXactBuffer(fdwxact->dbid, fdwxact->local_xid,
+                                   fdwxact->serverid, fdwxact->userid,
+                                   fdwxact->insert_start_lsn, fdwxact->ondisk);
+
+        if (buf == NULL)
+            continue;
+
+        ereport(LOG,
+                (errmsg("recovering foreign transaction %u for server %u and user %u from shared memory",
+                        fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+
+        /* recovered, so reset the flag for entries generated by redo */
+        fdwxact->inredo = false;
+        fdwxact->valid = true;
+
+        /*
+         * If the foreign transaction is part of the prepared local
+         * transaction, it's not in in-doubt. The future COMMIT/ROLLBACK
+         * PREPARED can determine the fate of this foreign transaction.
+         */
+        if (TwoPhaseExists(fdwxact->local_xid))
+        {
+            ereport(DEBUG2,
+                    (errmsg("clear in-doubt flag from foreign transaction %u, server %u, user %u as found the
correspondinglocal prepared transaction",
 
+                            fdwxact->local_xid, fdwxact->serverid,
+                            fdwxact->userid)));
+            fdwxact->indoubt = false;
+        }
+
+        pfree(buf);
+    }
+    LWLockRelease(FdwXactLock);
+}
+
+bool
+check_foreign_twophase_commit(int *newval, void **extra, GucSource source)
+{
+    ForeignTwophaseCommitLevel newForeignTwophaseCommitLevel = *newval;
+
+    /* Parameter check */
+    if (newForeignTwophaseCommitLevel > FOREIGN_TWOPHASE_COMMIT_DISABLED &&
+        (max_prepared_foreign_xacts == 0 || max_foreign_xact_resolvers == 0))
+    {
+        GUC_check_errdetail("Cannot enable \"foreign_twophase_commit\" when "
+                            "\"max_prepared_foreign_transactions\" or \"max_foreign_transaction_resolvers\""
+                            "is zero value");
+        return false;
+    }
+
+    return true;
+}
+
+/* Built in functions */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+    FdwXact        fdwxacts;
+    int            num_xacts;
+    int            cur_xact;
+}    WorkingStatus;
+
+Datum
+pg_foreign_xacts(PG_FUNCTION_ARGS)
+{
+#define PG_PREPARED_FDWXACTS_COLS    7
+    FuncCallContext *funcctx;
+    WorkingStatus *status;
+    char       *xact_status;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext oldcontext;
+        int            num_fdwxacts = 0;
+
+        /* create a function context for cross-call persistence */
+        funcctx = SRF_FIRSTCALL_INIT();
+
+        /*
+         * Switch to memory context appropriate for multiple function calls
+         */
+        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+        /* build tupdesc for result tuples */
+        /* this had better match pg_fdwxacts view in system_views.sql */
+        tupdesc = CreateTemplateTupleDesc(PG_PREPARED_FDWXACTS_COLS);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+                           XIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+                           TEXTOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 6, "indoubt",
+                           BOOLOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 7, "identifier",
+                           TEXTOID, -1, 0);
+
+        funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+        /*
+         * Collect status information that we will format and send out as a
+         * result set.
+         */
+        status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+        funcctx->user_fctx = (void *) status;
+
+        status->fdwxacts = get_all_fdwxacts(&num_fdwxacts);
+        status->num_xacts = num_fdwxacts;
+        status->cur_xact = 0;
+
+        MemoryContextSwitchTo(oldcontext);
+    }
+
+    funcctx = SRF_PERCALL_SETUP();
+    status = funcctx->user_fctx;
+
+    while (status->cur_xact < status->num_xacts)
+    {
+        FdwXact        fdwxact = &status->fdwxacts[status->cur_xact++];
+        Datum        values[PG_PREPARED_FDWXACTS_COLS];
+        bool        nulls[PG_PREPARED_FDWXACTS_COLS];
+        HeapTuple    tuple;
+        Datum        result;
+
+        if (!fdwxact->valid)
+            continue;
+
+        /*
+         * Form tuple with appropriate data.
+         */
+        MemSet(values, 0, sizeof(values));
+        MemSet(nulls, 0, sizeof(nulls));
+
+        values[0] = ObjectIdGetDatum(fdwxact->dbid);
+        values[1] = TransactionIdGetDatum(fdwxact->local_xid);
+        values[2] = ObjectIdGetDatum(fdwxact->serverid);
+        values[3] = ObjectIdGetDatum(fdwxact->userid);
+
+        switch (fdwxact->status)
+        {
+            case FDWXACT_STATUS_INITIAL:
+                xact_status = "initial";
+                break;
+            case FDWXACT_STATUS_PREPARING:
+                xact_status = "preparing";
+                break;
+            case FDWXACT_STATUS_PREPARED:
+                xact_status = "prepared";
+                break;
+            case FDWXACT_STATUS_COMMITTING:
+                xact_status = "committing";
+                break;
+            case FDWXACT_STATUS_ABORTING:
+                xact_status = "aborting";
+                break;
+            case FDWXACT_STATUS_RESOLVED:
+                xact_status = "resolved";
+                break;
+            default:
+                xact_status = "unknown";
+                break;
+        }
+        values[4] = CStringGetTextDatum(xact_status);
+        values[5] = BoolGetDatum(fdwxact->indoubt);
+        values[6] = PointerGetDatum(cstring_to_text_with_len(fdwxact->fdwxact_id,
+                                                             strlen(fdwxact->fdwxact_id)));
+
+        tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+        result = HeapTupleGetDatum(tuple);
+        SRF_RETURN_NEXT(funcctx, result);
+    }
+
+    SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to resolve a prepared foreign transaction manually.
+ */
+Datum
+pg_resolve_foreign_xact(PG_FUNCTION_ARGS)
+{
+    TransactionId    xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+    Oid                serverid = PG_GETARG_OID(1);
+    Oid                userid = PG_GETARG_OID(2);
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+    FdwXact            fdwxact;
+    FdwXactRslvState    *state;
+    FdwXactStatus        prev_status;
+
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 (errmsg("must be superuser to resolve foreign transactions"))));
+
+    server = GetForeignServer(serverid);
+    usermapping = GetUserMapping(userid, serverid);
+    state = create_fdwxact_state();
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    fdwxact = get_one_fdwxact(MyDatabaseId, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+    {
+        LWLockRelease(FdwXactLock);
+        PG_RETURN_BOOL(false);
+    }
+
+    state->server = server;
+    state->usermapping = usermapping;
+    state->fdwxact_id = pstrdup(fdwxact->fdwxact_id);
+
+    SpinLockAcquire(&fdwxact->mutex);
+    prev_status = fdwxact->status;
+    SpinLockRelease(&fdwxact->mutex);
+
+    FdwXactDetermineTransactionFate(fdwxact, false);
+
+    LWLockRelease(FdwXactLock);
+
+    FdwXactResolveForeignTransaction(fdwxact, state, prev_status);
+
+    PG_RETURN_BOOL(true);
+}
+
+/*
+ * Built-in function to remove a prepared foreign transaction entry without
+ * resolution. The function gives a way to forget about such prepared
+ * transaction in case: the foreign server where it is prepared is no longer
+ * available, the user which prepared this transaction needs to be dropped.
+ */
+Datum
+pg_remove_foreign_xact(PG_FUNCTION_ARGS)
+{
+    TransactionId    xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+    Oid                serverid = PG_GETARG_OID(1);
+    Oid                userid = PG_GETARG_OID(2);
+    FdwXact            fdwxact;
+
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 (errmsg("must be superuser to remove foreign transactions"))));
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    fdwxact = get_one_fdwxact(MyDatabaseId, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+        PG_RETURN_BOOL(false);
+
+    remove_fdwxact(fdwxact);
+
+    LWLockRelease(FdwXactLock);
+
+    PG_RETURN_BOOL(true);
+}
diff --git a/src/backend/access/fdwxact/launcher.c b/src/backend/access/fdwxact/launcher.c
new file mode 100644
index 0000000000..45fb530916
--- /dev/null
+++ b/src/backend/access/fdwxact/launcher.c
@@ -0,0 +1,644 @@
+/*-------------------------------------------------------------------------
+ *
+ * launcher.c
+ *
+ * The foreign transaction resolver launcher process starts foreign
+ * transaction resolver processes. The launcher schedules resolver
+ * process to be started when arrived a requested by backend process.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/backend/access/fdwxact/launcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "pgstat.h"
+#include "funcapi.h"
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_resolver.h"
+#include "access/resolver_internal.h"
+#include "commands/dbcommands.h"
+#include "nodes/pg_list.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+
+/* max sleep time between cycles (3min) */
+#define DEFAULT_NAPTIME_PER_CYCLE 180000L
+
+static void fdwxact_launcher_onexit(int code, Datum arg);
+static void fdwxact_launcher_sighup(SIGNAL_ARGS);
+static void fdwxact_launch_resolver(Oid dbid);
+static bool fdwxact_relaunch_resolvers(void);
+
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t got_SIGUSR2 = false;
+FdwXactResolver *MyFdwXactResolver = NULL;
+
+/*
+ * Wake up the launcher process to retry resolution.
+ */
+void
+FdwXactLauncherRequestToLaunchForRetry(void)
+{
+    if (FdwXactRslvCtl->launcher_pid != InvalidPid)
+        SetLatch(FdwXactRslvCtl->launcher_latch);
+}
+
+/*
+ * Wake up the launcher process to request launching new resolvers
+ * immediately.
+ */
+void
+FdwXactLauncherRequestToLaunch(void)
+{
+    if (FdwXactRslvCtl->launcher_pid != InvalidPid)
+        kill(FdwXactRslvCtl->launcher_pid, SIGUSR2);
+}
+
+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)
+{
+    Size        size = 0;
+
+    size = add_size(size, SizeOfFdwXactRslvCtlData);
+    size = add_size(size, mul_size(max_foreign_xact_resolvers,
+                                   sizeof(FdwXactResolver)));
+
+    return size;
+}
+
+/*
+ * Allocate and initialize foreign transaction resolver shared
+ * memory.
+ */
+void
+FdwXactRslvShmemInit(void)
+{
+    bool found;
+
+    FdwXactRslvCtl = ShmemInitStruct("Foreign transactions resolvers",
+                                     FdwXactRslvShmemSize(),
+                                     &found);
+
+    if (!IsUnderPostmaster)
+    {
+        int    slot;
+
+        /* First time through, so initialize */
+        MemSet(FdwXactRslvCtl, 0, FdwXactRslvShmemSize());
+
+        SHMQueueInit(&(FdwXactRslvCtl->fdwxact_queue));
+
+        for (slot = 0; slot < max_foreign_xact_resolvers; slot++)
+        {
+            FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[slot];
+
+            resolver->pid = InvalidPid;
+            resolver->dbid = InvalidOid;
+            resolver->in_use = false;
+            resolver->last_resolved_time = 0;
+            resolver->latch = NULL;
+            SpinLockInit(&(resolver->mutex));
+        }
+    }
+}
+
+/*
+ * Cleanup function for fdwxact launcher
+ *
+ * Called on fdwxact launcher exit.
+ */
+static void
+fdwxact_launcher_onexit(int code, Datum arg)
+{
+    FdwXactRslvCtl->launcher_pid = InvalidPid;
+}
+
+/* SIGHUP: set flag to reload configuration at next convenient time */
+static void
+fdwxact_launcher_sighup(SIGNAL_ARGS)
+{
+    int    save_errno = errno;
+
+    got_SIGHUP = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/* SIGUSR2: set flag to launch new resolver process immediately */
+static void
+fdwxact_launcher_sigusr2(SIGNAL_ARGS)
+{
+    int    save_errno = errno;
+
+    got_SIGUSR2 = true;
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/*
+ * Main loop for the fdwxact launcher process.
+ */
+void
+FdwXactLauncherMain(Datum main_arg)
+{
+    TimestampTz    last_start_time = 0;
+
+    ereport(DEBUG1,
+            (errmsg("fdwxact resolver launcher started")));
+
+    before_shmem_exit(fdwxact_launcher_onexit, (Datum) 0);
+
+    Assert(FdwXactRslvCtl->launcher_pid == 0);
+    FdwXactRslvCtl->launcher_pid = MyProcPid;
+    FdwXactRslvCtl->launcher_latch = &MyProc->procLatch;
+
+    pqsignal(SIGHUP, fdwxact_launcher_sighup);
+    pqsignal(SIGUSR2, fdwxact_launcher_sigusr2);
+    pqsignal(SIGTERM, die);
+    BackgroundWorkerUnblockSignals();
+
+    BackgroundWorkerInitializeConnection(NULL, NULL, 0);
+
+    /* Enter main loop */
+    for (;;)
+    {
+        TimestampTz    now;
+        long    wait_time = DEFAULT_NAPTIME_PER_CYCLE;
+        int        rc;
+
+        CHECK_FOR_INTERRUPTS();
+        ResetLatch(MyLatch);
+
+        now = GetCurrentTimestamp();
+
+        /*
+         * Limit the start retry to once a foreign_xact_resolution_retry_interval
+         * but always starts when the backend requested.
+         */
+        if (got_SIGUSR2 ||
+            TimestampDifferenceExceeds(last_start_time, now,
+                                       foreign_xact_resolution_retry_interval))
+        {
+            MemoryContext oldctx;
+            MemoryContext subctx;
+            bool launched;
+
+            if (got_SIGUSR2)
+                got_SIGUSR2 = false;
+
+            subctx = AllocSetContextCreate(TopMemoryContext,
+                                           "Foreign Transaction Launcher",
+                                           ALLOCSET_DEFAULT_SIZES);
+            oldctx = MemoryContextSwitchTo(subctx);
+
+            /*
+             * Launch foreign transaction resolvers that are requested
+             * but not running.
+             */
+            launched = fdwxact_relaunch_resolvers();
+            if (launched)
+            {
+                last_start_time = now;
+                wait_time = foreign_xact_resolution_retry_interval;
+            }
+
+            /* Switch back to original memory context. */
+            MemoryContextSwitchTo(oldctx);
+            /* Clean the temporary memory. */
+            MemoryContextDelete(subctx);
+        }
+        else
+        {
+            /*
+             * The wait in previous cycle was interrupted in less than
+             * foreign_xact_resolution_retry_interval since last resolver
+             * started, this usually means crash of the resolver, so we
+             * should retry in foreign_xact_resolution_retry_interval again.
+             */
+            wait_time = foreign_xact_resolution_retry_interval;
+        }
+
+        /* Wait for more work */
+        rc = WaitLatch(MyLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       wait_time,
+                       WAIT_EVENT_FDWXACT_LAUNCHER_MAIN);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            proc_exit(1);
+
+        if (rc & WL_LATCH_SET)
+        {
+            ResetLatch(MyLatch);
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (got_SIGHUP)
+        {
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+    }
+
+    /* Not reachable */
+}
+
+/*
+ * Request launcher to launch a new foreign transaction resolver process
+ * or wake up the resolver if it's already running.
+ */
+void
+FdwXactLaunchOrWakeupResolver(void)
+{
+    volatile FdwXactResolver *resolver;
+    bool    found = false;
+    int        i;
+
+    /*
+     * Looking for a resolver process that is running and working on the
+     * same database.
+     */
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (resolver->in_use &&
+            resolver->dbid == MyDatabaseId)
+        {
+            found = true;
+            break;
+        }
+    }
+    LWLockRelease(FdwXactResolverLock);
+
+    if (found)
+    {
+        /* Found the running resolver */
+        elog(DEBUG1,
+             "found a running foreign transaction resolver process for database %u",
+             MyDatabaseId);
+
+        /*
+         * Wakeup the resolver. It's possible that the resolver is starting up
+         * and doesn't attach its slot yet. Since the resolver will find FdwXact
+         * entry we inserted soon we don't anything.
+         */
+        if (resolver->latch)
+            SetLatch(resolver->latch);
+
+        return;
+    }
+
+    /* Otherwise wake up the launcher to launch new resolver */
+    FdwXactLauncherRequestToLaunch();
+}
+
+/*
+ * Launch a foreign transaction resolver process that will connect to given
+ * 'dbid'.
+ */
+static void
+fdwxact_launch_resolver(Oid dbid)
+{
+    BackgroundWorker bgw;
+    BackgroundWorkerHandle *bgw_handle;
+    FdwXactResolver *resolver;
+    int unused_slot;
+    int i;
+
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    /* Find unused resolver slot */
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (!resolver->in_use)
+        {
+            unused_slot = i;
+            break;
+        }
+    }
+
+    /* No unused found */
+    if (unused_slot > max_foreign_xact_resolvers)
+        ereport(ERROR,
+                (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                 errmsg("out of foreign trasanction resolver slots"),
+                 errhint("You might need to increase max_foreign_transaction_resolvers.")));
+
+    resolver = &FdwXactRslvCtl->resolvers[unused_slot];
+    resolver->in_use = true;
+    resolver->dbid = dbid;
+    LWLockRelease(FdwXactResolverLock);
+
+    /* Register the new dynamic worker */
+    memset(&bgw, 0, sizeof(bgw));
+    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+        BGWORKER_BACKEND_DATABASE_CONNECTION;
+    bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+    snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+    snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactResolverMain");
+    snprintf(bgw.bgw_name, BGW_MAXLEN,
+             "foreign transaction resolver for database %u", resolver->dbid);
+    snprintf(bgw.bgw_type, BGW_MAXLEN, "foreign transaction resolver");
+    bgw.bgw_restart_time = BGW_NEVER_RESTART;
+    bgw.bgw_notify_pid = MyProcPid;
+    bgw.bgw_main_arg = Int32GetDatum(unused_slot);
+
+    if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+    {
+        /* Failed to launch, cleanup the worker slot */
+        SpinLockAcquire(&(MyFdwXactResolver->mutex));
+        resolver->in_use = false;
+        SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+        ereport(WARNING,
+                (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                 errmsg("out of background worker slots"),
+                 errhint("You might need to increase max_worker_processes.")));
+    }
+
+    /*
+     * We don't need to wait until it attaches here because we're going to wait
+     * until all foreign transactions are resolved.
+     */
+}
+
+/*
+ * Launch or relaunch foreign transaction resolvers on database that has
+ * at least one FdwXact entry but no resolvers are running on it.
+ */
+static bool
+fdwxact_relaunch_resolvers(void)
+{
+    HTAB    *resolver_dbs;    /* DBs resolver's running on */
+    HTAB    *fdwxact_dbs;    /* DBs having at least one FdwXact entry */
+    HASHCTL    ctl;
+    HASH_SEQ_STATUS status;
+    Oid        *entry;
+    bool    launched;
+    int        i;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    resolver_dbs = hash_create("resolver dblist",
+                               32, &ctl, HASH_ELEM | HASH_BLOBS);
+    fdwxact_dbs = hash_create("fdwxact dblist",
+                              32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+    /* Collect database oids that has at least one non-in-doubt FdwXact entry */
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+
+        if (fdwxact->indoubt)
+            continue;
+
+        hash_search(fdwxact_dbs, &(fdwxact->dbid), HASH_ENTER, NULL);
+    }
+    LWLockRelease(FdwXactLock);
+
+    /* There is no FdwXact entry, no need to launch new one */
+    if (hash_get_num_entries(fdwxact_dbs) == 0)
+        return false;
+
+    /* Collect database oids on which resolvers are running */
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (!resolver->in_use)
+            continue;
+
+        hash_search(resolver_dbs, &(resolver->dbid), HASH_ENTER, NULL);
+    }
+    LWLockRelease(FdwXactResolverLock);
+
+    /* Find DBs on which no resolvers are running and launch new one on them */
+    hash_seq_init(&status, fdwxact_dbs);
+    while ((entry = (Oid *) hash_seq_search(&status)) != NULL)
+    {
+        bool found;
+
+        hash_search(resolver_dbs, entry, HASH_FIND, &found);
+
+        if (!found)
+        {
+            /* No resolver is running on this database, launch new one */
+            fdwxact_launch_resolver(*entry);
+            launched = true;
+        }
+    }
+
+    return launched;
+}
+
+/*
+ * FdwXactLauncherRegister
+ *        Register a background worker running the foreign transaction
+ *      launcher.
+ */
+void
+FdwXactLauncherRegister(void)
+{
+    BackgroundWorker bgw;
+
+    if (max_foreign_xact_resolvers == 0)
+        return;
+
+    memset(&bgw, 0, sizeof(bgw));
+    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+        BGWORKER_BACKEND_DATABASE_CONNECTION;
+    bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+    snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+    snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactLauncherMain");
+    snprintf(bgw.bgw_name, BGW_MAXLEN,
+             "foreign transaction launcher");
+    snprintf(bgw.bgw_type, BGW_MAXLEN,
+             "foreign transaction launcher");
+    bgw.bgw_restart_time = 5;
+    bgw.bgw_notify_pid = 0;
+    bgw.bgw_main_arg = (Datum) 0;
+
+    RegisterBackgroundWorker(&bgw);
+}
+
+bool
+IsFdwXactLauncher(void)
+{
+    return FdwXactRslvCtl->launcher_pid == MyProcPid;
+}
+
+/*
+ * Stop the fdwxact resolver running on the given database.
+ */
+Datum
+pg_stop_foreign_xact_resolver(PG_FUNCTION_ARGS)
+{
+    Oid dbid = PG_GETARG_OID(0);
+    FdwXactResolver *resolver = NULL;
+    int i;
+
+    /* Must be super user */
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 errmsg("permission denied to stop foreign transaction resolver")));
+
+    if (!OidIsValid(dbid))
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("invalid database id")));
+
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+
+    /* Find the running resolver process on the given database */
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        resolver = &FdwXactRslvCtl->resolvers[i];
+
+        /* found! */
+        if (resolver->in_use && resolver->dbid == dbid)
+            break;
+    }
+
+    if (i >= max_foreign_xact_resolvers)
+        ereport(ERROR,
+                (errmsg("there is no running foreign trasaction resolver process on database %d",
+                        dbid)));
+
+    /* Found the resolver, terminate it ... */
+    kill(resolver->pid, SIGTERM);
+
+    /* ... and wait for it to die */
+    for (;;)
+    {
+        int rc;
+
+        /* is it gone? */
+        if (!resolver->in_use)
+            break;
+
+        LWLockRelease(FdwXactResolverLock);
+
+         /* Wait a bit --- we don't expect to have to wait long. */
+        rc = WaitLatch(MyLatch,
+                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                        10L, WAIT_EVENT_BGWORKER_SHUTDOWN);
+
+        if (rc & WL_LATCH_SET)
+        {
+            ResetLatch(MyLatch);
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    }
+
+    LWLockRelease(FdwXactResolverLock);
+
+    PG_RETURN_BOOL(true);
+}
+
+/*
+ * Returns activity of all foreign transaction resolvers.
+ */
+Datum
+pg_stat_get_foreign_xact(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_FDWXACT_RESOLVERS_COLS 3
+    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc    tupdesc;
+    Tuplestorestate *tupstore;
+    MemoryContext per_query_ctx;
+    MemoryContext oldcontext;
+    int i;
+
+    /* check to see if caller supports us returning a tuplestore */
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver    *resolver = &FdwXactRslvCtl->resolvers[i];
+        pid_t    pid;
+        Oid        dbid;
+        TimestampTz last_resolved_time;
+        Datum        values[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+        bool        nulls[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+
+
+        SpinLockAcquire(&(resolver->mutex));
+        if (resolver->pid == InvalidPid)
+        {
+            SpinLockRelease(&(resolver->mutex));
+            continue;
+        }
+
+        pid = resolver->pid;
+        dbid = resolver->dbid;
+        last_resolved_time = resolver->last_resolved_time;
+        SpinLockRelease(&(resolver->mutex));
+
+        memset(nulls, 0, sizeof(nulls));
+        /* pid */
+        values[0] = Int32GetDatum(pid);
+
+        /* dbid */
+        values[1] = ObjectIdGetDatum(dbid);
+
+        /* last_resolved_time */
+        if (last_resolved_time == 0)
+            nulls[2] = true;
+        else
+            values[2] = TimestampTzGetDatum(last_resolved_time);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* clean up and return the tuplestore */
+    tuplestore_donestoring(tupstore);
+
+    return (Datum) 0;
+}
diff --git a/src/backend/access/fdwxact/resolver.c b/src/backend/access/fdwxact/resolver.c
new file mode 100644
index 0000000000..9298877f10
--- /dev/null
+++ b/src/backend/access/fdwxact/resolver.c
@@ -0,0 +1,344 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver.c
+ *
+ * The foreign transaction resolver background worker resolves foreign
+ * transactions that participate to a distributed transaction. A resolver
+ * process is started by foreign transaction launcher for each databases.
+ *
+ * A resolver process continues to resolve foreign transactions on the
+ * database, which the backend process is waiting for resolution.
+ *
+ * Normal termination is by SIGTERM, which instructs the resolver process
+ * to exit(0) at the next convenient moment. Emergency  termination is by
+ * SIGQUIT; like any backend. The resolver process also terminate by timeouts
+ * only if there is no pending foreign transactions on the database waiting
+ * to be resolved.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/backend/access/fdwxact/resolver.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
+#include "access/resolver_internal.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* max sleep time between cycles (3min) */
+#define DEFAULT_NAPTIME_PER_CYCLE 180000L
+
+/* GUC parameters */
+int foreign_xact_resolution_retry_interval;
+int foreign_xact_resolver_timeout = 60 * 1000;
+bool foreign_xact_resolve_indoubt_xacts;
+
+FdwXactRslvCtlData *FdwXactRslvCtl;
+
+static void FXRslvLoop(void);
+static long FXRslvComputeSleepTime(TimestampTz now, TimestampTz targetTime);
+static void FXRslvCheckTimeout(TimestampTz now);
+
+static void fdwxact_resolver_sighup(SIGNAL_ARGS);
+static void fdwxact_resolver_onexit(int code, Datum arg);
+static void fdwxact_resolver_detach(void);
+static void fdwxact_resolver_attach(int slot);
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/* Set flag to reload configuration at next convenient time */
+static void
+fdwxact_resolver_sighup(SIGNAL_ARGS)
+{
+    int        save_errno = errno;
+
+    got_SIGHUP = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/*
+ * Detach the resolver and cleanup the resolver info.
+ */
+static void
+fdwxact_resolver_detach(void)
+{
+    /* Block concurrent access */
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    MyFdwXactResolver->pid = InvalidPid;
+    MyFdwXactResolver->in_use = false;
+    MyFdwXactResolver->dbid = InvalidOid;
+
+    LWLockRelease(FdwXactResolverLock);
+}
+
+/*
+ * Cleanup up foreign transaction resolver info.
+ */
+static void
+fdwxact_resolver_onexit(int code, Datum arg)
+{
+    fdwxact_resolver_detach();
+
+    FdwXactLauncherRequestToLaunch();
+}
+
+/*
+ * Attach to a slot.
+ */
+static void
+fdwxact_resolver_attach(int slot)
+{
+    /* Block concurrent access */
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    Assert(slot >= 0 && slot < max_foreign_xact_resolvers);
+    MyFdwXactResolver = &FdwXactRslvCtl->resolvers[slot];
+
+    if (!MyFdwXactResolver->in_use)
+    {
+        LWLockRelease(FdwXactResolverLock);
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("foreign transaction resolver slot %d is empty, cannot attach",
+                        slot)));
+    }
+
+    Assert(OidIsValid(MyFdwXactResolver->dbid));
+
+    MyFdwXactResolver->pid = MyProcPid;
+    MyFdwXactResolver->latch = &MyProc->procLatch;
+    MyFdwXactResolver->last_resolved_time = 0;
+
+    before_shmem_exit(fdwxact_resolver_onexit, (Datum) 0);
+
+    LWLockRelease(FdwXactResolverLock);
+}
+
+/* Foreign transaction resolver entry point */
+void
+FdwXactResolverMain(Datum main_arg)
+{
+    int slot = DatumGetInt32(main_arg);
+
+    /* Attach to a slot */
+    fdwxact_resolver_attach(slot);
+
+    /* Establish signal handlers */
+    pqsignal(SIGHUP, fdwxact_resolver_sighup);
+    pqsignal(SIGTERM, die);
+    BackgroundWorkerUnblockSignals();
+
+    /* Connect to our database */
+    BackgroundWorkerInitializeConnectionByOid(MyFdwXactResolver->dbid, InvalidOid, 0);
+
+    StartTransactionCommand();
+
+    ereport(LOG,
+            (errmsg("foreign transaction resolver for database \"%s\" has started",
+                    get_database_name(MyFdwXactResolver->dbid))));
+
+    CommitTransactionCommand();
+
+    /* Initialize stats to a sanish value */
+    MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+
+    /* Run the main loop */
+    FXRslvLoop();
+
+    proc_exit(0);
+}
+
+/*
+ * Fdwxact resolver main loop
+ */
+static void
+FXRslvLoop(void)
+{
+    MemoryContext resolver_ctx;
+
+    resolver_ctx = AllocSetContextCreate(TopMemoryContext,
+                                         "Foreign Transaction Resolver",
+                                         ALLOCSET_DEFAULT_SIZES);
+
+    /* Enter main loop */
+    for (;;)
+    {
+        PGPROC            *waiter = NULL;
+        TransactionId    waitXid = InvalidTransactionId;
+        TimestampTz        resolutionTs = -1;
+        int            rc;
+        TimestampTz    now;
+        long        sleep_time = DEFAULT_NAPTIME_PER_CYCLE;
+
+        ResetLatch(MyLatch);
+
+        CHECK_FOR_INTERRUPTS();
+
+        MemoryContextSwitchTo(resolver_ctx);
+
+        if (got_SIGHUP)
+        {
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+
+        now = GetCurrentTimestamp();
+
+        /*
+         * Process waiter until either the queue gets empty or got the waiter
+         * that has future resolution time.
+         */
+        while ((waiter = FdwXactGetWaiter(&resolutionTs, &waitXid)) != NULL)
+        {
+            CHECK_FOR_INTERRUPTS();
+            Assert(TransactionIdIsValid(waitXid));
+
+            if    (resolutionTs > now)
+                break;
+
+            elog(DEBUG2, "resolver got one waiter with xid %u", waitXid);
+
+            /* Resolve the waiting distributed transaction */
+            StartTransactionCommand();
+            FdwXactResolveTransactionAndReleaseWaiter(MyDatabaseId, waitXid,
+                                                      waiter);
+            CommitTransactionCommand();
+
+            /* Update my stats */
+            SpinLockAcquire(&(MyFdwXactResolver->mutex));
+            MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+            SpinLockRelease(&(MyFdwXactResolver->mutex));
+        }
+
+        FXRslvCheckTimeout(now);
+
+        sleep_time = FXRslvComputeSleepTime(now, resolutionTs);
+
+        MemoryContextResetAndDeleteChildren(resolver_ctx);
+        MemoryContextSwitchTo(TopMemoryContext);
+
+        rc = WaitLatch(MyLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       sleep_time,
+                       WAIT_EVENT_FDWXACT_RESOLVER_MAIN);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            proc_exit(1);
+    }
+}
+
+/*
+ * Check whether there have been foreign transactions by the backend within
+ * foreign_xact_resolver_timeout and shutdown if not.
+ */
+static void
+FXRslvCheckTimeout(TimestampTz now)
+{
+    TimestampTz last_resolved_time;
+    TimestampTz timeout;
+
+    if (foreign_xact_resolver_timeout == 0)
+        return;
+
+    last_resolved_time = MyFdwXactResolver->last_resolved_time;
+    timeout = TimestampTzPlusMilliseconds(last_resolved_time,
+                                          foreign_xact_resolver_timeout);
+
+    if (now < timeout)
+        return;
+
+    LWLockAcquire(FdwXactResolutionLock, LW_SHARED);
+    if (!FdwXactWaiterExists(MyDatabaseId))
+    {
+        StartTransactionCommand();
+        ereport(LOG,
+                (errmsg("foreign transaction resolver for database \"%s\" will stop because the timeout",
+                        get_database_name(MyDatabaseId))));
+        CommitTransactionCommand();
+
+        /*
+         * Keep holding FdwXactResolutionLock until detached the slot. It is
+         * necessary to prevent a race condition; a waiter enqueues after
+         * checked FdwXactWaiterExists.
+         */
+        fdwxact_resolver_detach();
+        LWLockRelease(FdwXactResolutionLock);
+        proc_exit(0);
+    }
+    else
+        elog(DEBUG2, "resolver reached to the timeout but don't exist as the queue is not empty");
+
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * Compute how long we should sleep by the next cycle. We can sleep until the time
+ * out or the next resolution time given by nextResolutionTs.
+ */
+static long
+FXRslvComputeSleepTime(TimestampTz now, TimestampTz nextResolutionTs)
+{
+    long    sleeptime = DEFAULT_NAPTIME_PER_CYCLE;
+
+    if (foreign_xact_resolver_timeout > 0)
+    {
+        TimestampTz timeout;
+        long    sec_to_timeout;
+        int        microsec_to_timeout;
+
+        /* Compute relative time until wakeup. */
+        timeout = TimestampTzPlusMilliseconds(MyFdwXactResolver->last_resolved_time,
+                                              foreign_xact_resolver_timeout);
+        TimestampDifference(now, timeout,
+                            &sec_to_timeout, µsec_to_timeout);
+
+        sleeptime = Min(sleeptime,
+                        sec_to_timeout * 1000 + microsec_to_timeout / 1000);
+    }
+
+    if (nextResolutionTs > 0)
+    {
+        long    sec_to_timeout;
+        int        microsec_to_timeout;
+
+        TimestampDifference(now, nextResolutionTs,
+                            &sec_to_timeout, µsec_to_timeout);
+
+        sleeptime = Min(sleeptime,
+                        sec_to_timeout * 1000 + microsec_to_timeout / 1000);
+    }
+
+    return sleeptime;
+}
+
+bool
+IsFdwXactResolver(void)
+{
+    return MyFdwXactResolver != NULL;
+}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..982c1a36cc 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -13,6 +13,7 @@ OBJS = \
     clogdesc.o \
     committsdesc.o \
     dbasedesc.o \
+    fdwxactdesc.o \
     genericdesc.o \
     gindesc.o \
     gistdesc.o \
diff --git a/src/backend/access/rmgrdesc/fdwxactdesc.c b/src/backend/access/rmgrdesc/fdwxactdesc.c
new file mode 100644
index 0000000000..fe0cef9472
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdwxactdesc.c
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxactdesc.c
+ *        PostgreSQL global transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/backend/access/rmgrdesc/fdwxactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdwxact_xlog.h"
+
+void
+fdwxact_desc(StringInfo buf, XLogReaderState *record)
+{
+    char       *rec = XLogRecGetData(record);
+    uint8        info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+    if (info == XLOG_FDWXACT_INSERT)
+    {
+        FdwXactOnDiskData *fdwxact_insert = (FdwXactOnDiskData *) rec;
+
+        appendStringInfo(buf, "server: %u,", fdwxact_insert->serverid);
+        appendStringInfo(buf, " user: %u,", fdwxact_insert->userid);
+        appendStringInfo(buf, " database: %u,", fdwxact_insert->dbid);
+        appendStringInfo(buf, " local xid: %u,", fdwxact_insert->local_xid);
+        appendStringInfo(buf, " id: %s", fdwxact_insert->fdwxact_id);
+    }
+    else
+    {
+        xl_fdwxact_remove *fdwxact_remove = (xl_fdwxact_remove *) rec;
+
+        appendStringInfo(buf, "server: %u,", fdwxact_remove->serverid);
+        appendStringInfo(buf, " user: %u,", fdwxact_remove->userid);
+        appendStringInfo(buf, " database: %u,", fdwxact_remove->dbid);
+        appendStringInfo(buf, " local xid: %u", fdwxact_remove->xid);
+    }
+
+}
+
+const char *
+fdwxact_identify(uint8 info)
+{
+    switch (info & ~XLR_INFO_MASK)
+    {
+        case XLOG_FDWXACT_INSERT:
+            return "NEW FOREIGN TRANSACTION";
+        case XLOG_FDWXACT_REMOVE:
+            return "REMOVE FOREIGN TRANSACTION";
+    }
+    /* Keep compiler happy */
+    return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 33060f3042..1d4e1c82e1 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
         appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
                          "max_wal_senders=%d max_prepared_xacts=%d "
                          "max_locks_per_xact=%d wal_level=%s "
-                         "wal_log_hints=%s track_commit_timestamp=%s",
+                         "wal_log_hints=%s track_commit_timestamp=%s "
+                         "max_prepared_foreign_transactions=%d",
                          xlrec.MaxConnections,
                          xlrec.max_worker_processes,
                          xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
                          xlrec.max_locks_per_xact,
                          wal_level_str,
                          xlrec.wal_log_hints ? "on" : "off",
-                         xlrec.track_commit_timestamp ? "on" : "off");
+                         xlrec.track_commit_timestamp ? "on" : "off",
+                         xlrec.max_prepared_foreign_xacts);
     }
     else if (info == XLOG_FPW_CHANGE)
     {
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..200cf9d067 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -10,6 +10,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 529976885f..2c9af36bbb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -850,6 +851,35 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
     return result;
 }
 
+/*
+ * TwoPhaseExists
+ *        Return true if there is a prepared transaction specified by XID
+ */
+bool
+TwoPhaseExists(TransactionId xid)
+{
+    int        i;
+    bool    found = false;
+
+    LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+    for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+    {
+        GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+        PGXACT    *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+
+        if (pgxact->xid == xid)
+        {
+            found = true;
+            break;
+        }
+    }
+
+    LWLockRelease(TwoPhaseStateLock);
+
+    return found;
+}
+
 /*
  * TwoPhaseGetDummyBackendId
  *        Get the dummy backend ID for prepared transaction specified by XID
@@ -2262,6 +2292,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
      * in the procarray and continue to hold locks.
      */
     SyncRepWaitForLSN(recptr, true);
+
+    /*
+     * Wait for foreign transaction prepared as part of this prepared
+     * transaction to be committed.
+     */
+    FdwXactWaitToBeResolved(xid, true);
 }
 
 /*
@@ -2321,6 +2357,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
      * in the procarray and continue to hold locks.
      */
     SyncRepWaitForLSN(recptr, false);
+
+    /*
+     * Wait for foreign transaction prepared as part of this prepared
+     * transaction to be committed.
+     */
+    FdwXactWaitToBeResolved(xid, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5353b6ab0b..5b67056c65 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1218,6 +1219,7 @@ RecordTransactionCommit(void)
     SharedInvalidationMessage *invalMessages = NULL;
     bool        RelcacheInitFileInval = false;
     bool        wrote_xlog;
+    bool        need_commit_globally;
 
     /* Get data needed for commit record */
     nrels = smgrGetPendingDeletes(true, &rels);
@@ -1226,6 +1228,7 @@ RecordTransactionCommit(void)
         nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
                                                      &RelcacheInitFileInval);
     wrote_xlog = (XactLastRecEnd != 0);
+    need_commit_globally = FdwXactIsForeignTwophaseCommitRequired();
 
     /*
      * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1264,12 +1267,13 @@ RecordTransactionCommit(void)
         }
 
         /*
-         * If we didn't create XLOG entries, we're done here; otherwise we
-         * should trigger flushing those entries the same as a commit record
+         * If we didn't create XLOG entries and the transaction does not need
+         * to be committed using two-phase commit. we're done here; otherwise
+         * we should trigger flushing those entries the same as a commit record
          * would.  This will primarily happen for HOT pruning and the like; we
          * want these to be flushed to disk in due time.
          */
-        if (!wrote_xlog)
+        if (!wrote_xlog && !need_commit_globally)
             goto cleanup;
     }
     else
@@ -1427,6 +1431,14 @@ RecordTransactionCommit(void)
     if (wrote_xlog && markXidCommitted)
         SyncRepWaitForLSN(XactLastRecEnd, true);
 
+    /*
+     * Wait for prepared foreign transaction to be resolved, if required.
+     * We only want to wait if we prepared foreign transaction in this
+     * transaction.
+     */
+    if (need_commit_globally && markXidCommitted)
+        FdwXactWaitToBeResolved(xid, true);
+
     /* remember end of last commit record */
     XactLastCommitEnd = XactLastRecEnd;
 
@@ -2086,6 +2098,10 @@ CommitTransaction(void)
             break;
     }
 
+ 
+    /* Pre-commit step for foreign transactions */
+    PreCommit_FdwXacts();
+
     CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
                       : XACT_EVENT_PRE_COMMIT);
 
@@ -2246,6 +2262,7 @@ CommitTransaction(void)
     AtEOXact_PgStat(true, is_parallel_worker);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
+    AtEOXact_FdwXacts(true);
     pgstat_report_xact_timestamp(0);
 
     CurrentResourceOwner = NULL;
@@ -2333,6 +2350,8 @@ PrepareTransaction(void)
      * the transaction-abort path.
      */
 
+    AtPrepare_FdwXacts();
+
     /* Shut down the deferred-trigger manager */
     AfterTriggerEndXact(true);
 
@@ -2527,6 +2546,7 @@ PrepareTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_FdwXacts(true);
     /* don't call AtEOXact_PgStat here; we fixed pgstat state above */
     AtEOXact_Snapshot(true, true);
     pgstat_report_xact_timestamp(0);
@@ -2732,6 +2752,7 @@ AbortTransaction(void)
         AtEOXact_HashTables(false);
         AtEOXact_PgStat(false, is_parallel_worker);
         AtEOXact_ApplyLauncher(false);
+        AtEOXact_FdwXacts(false);
         pgstat_report_xact_timestamp(0);
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6bc1a6b46d..428a974c51 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -5246,6 +5247,7 @@ BootStrapXLOG(void)
     ControlFile->max_worker_processes = max_worker_processes;
     ControlFile->max_wal_senders = max_wal_senders;
     ControlFile->max_prepared_xacts = max_prepared_xacts;
+    ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
     ControlFile->max_locks_per_xact = max_locks_per_xact;
     ControlFile->wal_level = wal_level;
     ControlFile->wal_log_hints = wal_log_hints;
@@ -6189,6 +6191,9 @@ CheckRequiredParameterValues(void)
         RecoveryRequiresIntParameter("max_wal_senders",
                                      max_wal_senders,
                                      ControlFile->max_wal_senders);
+        RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+                                     max_prepared_foreign_xacts,
+                                     ControlFile->max_prepared_foreign_xacts);
         RecoveryRequiresIntParameter("max_prepared_transactions",
                                      max_prepared_xacts,
                                      ControlFile->max_prepared_xacts);
@@ -6729,14 +6734,15 @@ StartupXLOG(void)
     restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
 
     /*
-     * Before running in recovery, scan pg_twophase and fill in its status to
-     * be able to work on entries generated by redo.  Doing a scan before
-     * taking any recovery action has the merit to discard any 2PC files that
-     * are newer than the first record to replay, saving from any conflicts at
-     * replay.  This avoids as well any subsequent scans when doing recovery
-     * of the on-disk two-phase data.
+     * Before running in recovery, scan pg_twophase and pg_fdwxacts, and then
+     * fill in its status to be able to work on entries generated by redo.
+     * Doing a scan before taking any recovery action has the merit to discard
+     * any state files that are newer than the first record to replay, saving
+     * from any conflicts at replay.  This avoids as well any subsequent scans
+     * when doing recovery of the on-disk two-phase or fdwxact data.
      */
     restoreTwoPhaseData();
+    restoreFdwXactData();
 
     lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -6928,7 +6934,10 @@ StartupXLOG(void)
             InitRecoveryTransactionEnvironment();
 
             if (wasShutdown)
+            {
                 oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+                oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
+            }
             else
                 oldestActiveXID = checkPoint.oldestActiveXid;
             Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7424,6 +7433,7 @@ StartupXLOG(void)
      * as potential problems are detected before any on-disk change is done.
      */
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+    oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
     /*
      * Consider whether we need to assign a new timeline ID.
@@ -7754,6 +7764,9 @@ StartupXLOG(void)
     /* Reload shared-memory state for prepared transactions */
     RecoverPreparedTransactions();
 
+    /* Load all foreign transaction entries from disk to memory */
+    RecoverFdwXacts();
+
     /*
      * Shutdown the recovery environment. This must occur after
      * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9029,6 +9042,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
     CheckPointReplicationOrigin();
     /* We deliberately delay 2PC checkpointing as long as possible */
     CheckPointTwoPhase(checkPointRedo);
+    CheckPointFdwXacts(checkPointRedo);
 }
 
 /*
@@ -9462,8 +9476,10 @@ XLogReportParameters(void)
         max_worker_processes != ControlFile->max_worker_processes ||
         max_wal_senders != ControlFile->max_wal_senders ||
         max_prepared_xacts != ControlFile->max_prepared_xacts ||
+        max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts ||
         max_locks_per_xact != ControlFile->max_locks_per_xact ||
-        track_commit_timestamp != ControlFile->track_commit_timestamp)
+        track_commit_timestamp != ControlFile->track_commit_timestamp ||
+        max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
     {
         /*
          * The change in number of backend slots doesn't need to be WAL-logged
@@ -9481,6 +9497,7 @@ XLogReportParameters(void)
             xlrec.max_worker_processes = max_worker_processes;
             xlrec.max_wal_senders = max_wal_senders;
             xlrec.max_prepared_xacts = max_prepared_xacts;
+            xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
             xlrec.max_locks_per_xact = max_locks_per_xact;
             xlrec.wal_level = wal_level;
             xlrec.wal_log_hints = wal_log_hints;
@@ -9497,6 +9514,7 @@ XLogReportParameters(void)
         ControlFile->max_worker_processes = max_worker_processes;
         ControlFile->max_wal_senders = max_wal_senders;
         ControlFile->max_prepared_xacts = max_prepared_xacts;
+        ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
         ControlFile->max_locks_per_xact = max_locks_per_xact;
         ControlFile->wal_level = wal_level;
         ControlFile->wal_log_hints = wal_log_hints;
@@ -9702,6 +9720,7 @@ xlog_redo(XLogReaderState *record)
             RunningTransactionsData running;
 
             oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+            oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
             /*
              * Construct a RunningTransactions snapshot representing a shut
@@ -9901,6 +9920,7 @@ xlog_redo(XLogReaderState *record)
         ControlFile->max_worker_processes = xlrec.max_worker_processes;
         ControlFile->max_wal_senders = xlrec.max_wal_senders;
         ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+        ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
         ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
         ControlFile->wal_level = xlrec.wal_level;
         ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..b4c1cce1f0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -332,6 +332,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_foreign_xacts AS
+       SELECT * FROM pg_foreign_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
     l.objoid, l.classoid, l.objsubid,
@@ -818,6 +821,14 @@ CREATE VIEW pg_stat_subscription AS
             LEFT JOIN pg_stat_get_subscription(NULL) st
                       ON (st.subid = su.oid);
 
+CREATE VIEW pg_stat_foreign_xact AS
+    SELECT
+            r.pid,
+            r.dbid,
+            r.last_resolved_time
+    FROM pg_stat_get_foreign_xact() r
+    WHERE r.pid IS NOT NULL;
+
 CREATE VIEW pg_stat_ssl AS
     SELECT
             S.pid,
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..e3caef7ef9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2857,8 +2857,14 @@ CopyFrom(CopyState cstate)
 
     if (resultRelInfo->ri_FdwRoutine != NULL &&
         resultRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
+    {
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(RelationGetRelid(resultRelInfo->ri_RelationDesc),
+                               true);
+
         resultRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate,
                                                          resultRelInfo);
+    }
 
     /* Prepare to catch AFTER triggers. */
     AfterTriggerBeginQuery();
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 766c9f95c8..43bbe8356d 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,8 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
+#include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/table.h"
@@ -1101,6 +1103,18 @@ RemoveForeignServerById(Oid srvId)
     if (!HeapTupleIsValid(tp))
         elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+    /*
+     * If there is a foreign prepared transaction with this foreign server,
+     * dropping it might result in dangling prepared transaction.
+     */
+    if (fdwxact_exists(MyDatabaseId, srvId, InvalidOid))
+    {
+        Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+        ereport(WARNING,
+                (errmsg("server \"%s\" has unresolved prepared transactions on it",
+                        NameStr(srvForm->srvname))));
+    }
+
     CatalogTupleDelete(rel, &tp->t_self);
 
     ReleaseSysCache(tp);
@@ -1419,6 +1433,15 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 
     user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
+    /*
+     * If there is a foreign prepared transaction with this user mapping,
+     * dropping it might result in dangling prepared transaction.
+     */
+    if (fdwxact_exists(MyDatabaseId, srv->serverid,    useId))
+        ereport(WARNING,
+                (errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+                        srv->servername, MappingUserName(useId))));
+
     /*
      * Do the deletion
      */
@@ -1572,6 +1595,13 @@ ImportForeignSchema(ImportForeignSchemaStmt *stmt)
                  errmsg("foreign-data wrapper \"%s\" does not support IMPORT FOREIGN SCHEMA",
                         fdw->fdwname)));
 
+    /*
+     * Remember the transaction accesses to a foreign server. Normally during
+     * ImportForeignSchema we don't modify data on foreign servers, so remember it
+     * as not-modified server.
+     */
+    RegisterFdwXactByServerId(server->serverid, false);
+
     /* Call FDW to get a list of commands */
     cmd_list = fdw_routine->ImportForeignSchema(stmt, server->serverid);
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d23f292cb0..690717c34e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/table.h"
 #include "access/tableam.h"
 #include "catalog/partition.h"
@@ -944,7 +945,14 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
      */
     if (partRelInfo->ri_FdwRoutine != NULL &&
         partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
+    {
+        Relation        child = partRelInfo->ri_RelationDesc;
+
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(RelationGetRelid(child), true);
+
         partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
+    }
 
     partRelInfo->ri_PartitionInfo = partrouteinfo;
     partRelInfo->ri_CopyMultiInsertBuffer = NULL;
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 52af1dac5c..3ac56d1678 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -22,6 +22,8 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
+#include "access/xact.h"
 #include "executor/executor.h"
 #include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
@@ -224,9 +226,31 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
      * Tell the FDW to initialize the scan.
      */
     if (node->operation != CMD_SELECT)
+    {
+        RangeTblEntry    *rte;
+
+        rte = exec_rt_fetch(estate->es_result_relation_info->ri_RangeTableIndex,
+                            estate);
+
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(rte->relid, true);
+
         fdwroutine->BeginDirectModify(scanstate, eflags);
+    }
     else
+    {
+        RangeTblEntry    *rte;
+        int rtindex = (scanrelid > 0) ?
+            scanrelid :
+            bms_next_member(node->fs_relids, -1);
+
+        rte = exec_rt_fetch(rtindex, estate);
+
+        /* Remember the transaction accesses to a foreign server */
+        RegisterFdwXactByRelId(rte->relid, false);
+
         fdwroutine->BeginForeignScan(scanstate, eflags);
+    }
 
     return scanstate;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index cd91f9c8a8..c1ab3d829a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -37,6 +37,7 @@
 
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/tableam.h"
@@ -47,6 +48,7 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 #include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "rewrite/rewriteHandler.h"
@@ -549,6 +551,10 @@ ExecInsert(ModifyTableState *mtstate,
                                            NULL,
                                            specToken);
 
+            /* Make note that we've wrote on non-temprary relation */
+            if (RelationNeedsWAL(resultRelationDesc))
+                MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
             /* insert index entries for tuple */
             recheckIndexes = ExecInsertIndexTuples(slot, estate, true,
                                                    &specConflict,
@@ -777,6 +783,10 @@ ldelete:;
                                     &tmfd,
                                     changingPart);
 
+        /* Make note that we've wrote on non-temprary relation */
+        if (RelationNeedsWAL(resultRelationDesc))
+            MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
         switch (result)
         {
             case TM_SelfModified:
@@ -1323,6 +1333,10 @@ lreplace:;
                                     true /* wait for commit */ ,
                                     &tmfd, &lockmode, &update_indexes);
 
+        /* Make note that we've wrote on non-temprary relation */
+        if (RelationNeedsWAL(resultRelationDesc))
+            MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
         switch (result)
         {
             case TM_SelfModified:
@@ -2382,6 +2396,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
             resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
         {
             List       *fdw_private = (List *) list_nth(node->fdwPrivLists, i);
+            Oid            relid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+            /* Remember the transaction modifies data on a foreign server*/
+            RegisterFdwXactByRelId(relid, true);
 
             resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,
                                                              resultRelInfo,
diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c
index c917ec40ff..0b17505aac 100644
--- a/src/backend/foreign/foreign.c
+++ b/src/backend/foreign/foreign.c
@@ -187,6 +187,49 @@ GetForeignServerByName(const char *srvname, bool missing_ok)
     return GetForeignServer(serverid);
 }
 
+/*
+ * GetUserMappingOid - look up the user mapping by user mapping oid.
+ *
+ * If userid of the mapping is invalid, we set it to current userid.
+ */
+UserMapping *
+GetUserMappingByOid(Oid umid)
+{
+    Datum        datum;
+    HeapTuple   tp;
+    UserMapping    *um;
+    bool        isnull;
+    Form_pg_user_mapping tableform;
+
+    tp = SearchSysCache1(USERMAPPINGOID,
+                         ObjectIdGetDatum(umid));
+
+    if (!HeapTupleIsValid(tp))
+        ereport(ERROR,
+                (errcode(ERRCODE_UNDEFINED_OBJECT),
+                 errmsg("user mapping not found for %d", umid)));
+
+    tableform = (Form_pg_user_mapping) GETSTRUCT(tp);
+    um = (UserMapping *) palloc(sizeof(UserMapping));
+    um->umid = umid;
+    um->userid = OidIsValid(tableform->umuser) ?
+        tableform->umuser : GetUserId();
+    um->serverid = tableform->umserver;
+
+    /* Extract the umoptions */
+    datum = SysCacheGetAttr(USERMAPPINGUSERSERVER,
+                            tp,
+                            Anum_pg_user_mapping_umoptions,
+                            &isnull);
+    if (isnull)
+        um->options = NIL;
+    else
+        um->options = untransformRelOptions(datum);
+
+    ReleaseSysCache(tp);
+
+    return um;
+}
 
 /*
  * GetUserMapping - look up the user mapping.
@@ -328,6 +371,20 @@ GetFdwRoutine(Oid fdwhandler)
         elog(ERROR, "foreign-data wrapper handler function %u did not return an FdwRoutine struct",
              fdwhandler);
 
+    /* Sanity check for transaction management callbacks */
+    if ((routine->CommitForeignTransaction &&
+         !routine->RollbackForeignTransaction) ||
+        (!routine->CommitForeignTransaction &&
+         routine->RollbackForeignTransaction))
+        elog(ERROR,
+             "foreign-data-wrapper must support both commit and rollback routine or either");
+
+    if (routine->PrepareForeignTransaction &&
+        (!routine->CommitForeignTransaction ||
+         !routine->RollbackForeignTransaction))
+        elog(ERROR,
+             "foreign-data wrapper that supports prepare routine must support both commit and rollback routines");
+
     return routine;
 }
 
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 5f8a007e73..0a8890a984 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -14,6 +14,8 @@
 
 #include <unistd.h>
 
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_resolver.h"
 #include "access/parallel.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -129,6 +131,12 @@ static const struct
     },
     {
         "ApplyWorkerMain", ApplyWorkerMain
+    },
+    {
+        "FdwXactResolverMain", FdwXactResolverMain
+    },
+    {
+        "FdwXactLauncherMain", FdwXactLauncherMain
     }
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31de8..0d3932c2cf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3650,6 +3650,12 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_CHECKPOINTER_MAIN:
             event_name = "CheckpointerMain";
             break;
+        case WAIT_EVENT_FDWXACT_RESOLVER_MAIN:
+            event_name = "FdwXactResolverMain";
+            break;
+        case WAIT_EVENT_FDWXACT_LAUNCHER_MAIN:
+            event_name = "FdwXactLauncherMain";
+            break;
         case WAIT_EVENT_LOGICAL_APPLY_MAIN:
             event_name = "LogicalApplyMain";
             break;
@@ -3853,6 +3859,11 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_FDWXACT:
+            event_name = "FdwXact";
+        case WAIT_EVENT_FDWXACT_RESOLUTION:
+            event_name = "FdwXactResolution";
+            break;
             /* no default case, so that compiler will warn */
     }
 
@@ -4068,6 +4079,15 @@ pgstat_get_wait_io(WaitEventIO w)
         case WAIT_EVENT_TWOPHASE_FILE_WRITE:
             event_name = "TwophaseFileWrite";
             break;
+        case WAIT_EVENT_FDWXACT_FILE_WRITE:
+            event_name = "FdwXactFileWrite";
+            break;
+        case WAIT_EVENT_FDWXACT_FILE_READ:
+            event_name = "FdwXactFileRead";
+            break;
+        case WAIT_EVENT_FDWXACT_FILE_SYNC:
+            event_name = "FdwXactFileSync";
+            break;
         case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
             event_name = "WALSenderTimelineHistoryRead";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9ff2832c00..f92be8387d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -93,6 +93,8 @@
 #include <pthread.h>
 #endif
 
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "bootstrap/bootstrap.h"
@@ -909,6 +911,10 @@ PostmasterMain(int argc, char *argv[])
         ereport(ERROR,
                 (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
 
+    if (max_prepared_foreign_xacts > 0 && max_foreign_xact_resolvers == 0)
+        ereport(ERROR,
+                (errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires
max_foreign_transaction_resolvers> 0")));
 
+
     /*
      * Other one-time internal sanity checks can go here, if they are fast.
      * (Put any slow processing further down, after postmaster.pid creation.)
@@ -984,12 +990,13 @@ PostmasterMain(int argc, char *argv[])
 #endif
 
     /*
-     * Register the apply launcher.  Since it registers a background worker,
-     * it needs to be called before InitializeMaxBackends(), and it's probably
-     * a good idea to call it before any modules had chance to take the
-     * background worker slots.
+     * Register the apply launcher and foreign transaction launcher.  Since
+     * it registers a background worker, it needs to be called before
+     * InitializeMaxBackends(), and it's probably a good idea to call it
+     * before any modules had chance to take the background worker slots.
      */
     ApplyLauncherRegister();
+    FdwXactLauncherRegister();
 
     /*
      * process any libraries that should be preloaded at postmaster start
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d027b..6269f384af 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -151,6 +151,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
         case RM_COMMIT_TS_ID:
         case RM_REPLORIGIN_ID:
         case RM_GENERIC_ID:
+        case RM_FDWXACT_ID:
             /* just deal with xid, and done */
             ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
                                     buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4829953ee6..6bde7a735a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,8 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_launcher.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -147,6 +149,8 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, FdwXactShmemSize());
+        size = add_size(size, FdwXactRslvShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +267,8 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    FdwXactShmemInit();
+    FdwXactRslvShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 13bcbe77de..020eb76b6a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -93,6 +93,8 @@ typedef struct ProcArrayStruct
     TransactionId replication_slot_xmin;
     /* oldest catalog xmin of any replication slot */
     TransactionId replication_slot_catalog_xmin;
+    /* local transaction id of oldest unresolved distributed transaction */
+    TransactionId fdwxact_unresolved_xmin;
 
     /* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
     int            pgprocnos[FLEXIBLE_ARRAY_MEMBER];
@@ -248,6 +250,7 @@ CreateSharedProcArray(void)
         procArray->lastOverflowedXid = InvalidTransactionId;
         procArray->replication_slot_xmin = InvalidTransactionId;
         procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+        procArray->fdwxact_unresolved_xmin = InvalidTransactionId;
     }
 
     allProcs = ProcGlobal->allProcs;
@@ -1312,6 +1315,7 @@ GetOldestXmin(Relation rel, int flags)
 
     TransactionId replication_slot_xmin = InvalidTransactionId;
     TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+    TransactionId fdwxact_unresolved_xmin = InvalidTransactionId;
 
     /*
      * If we're not computing a relation specific limit, or if a shared
@@ -1377,6 +1381,7 @@ GetOldestXmin(Relation rel, int flags)
      */
     replication_slot_xmin = procArray->replication_slot_xmin;
     replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+    fdwxact_unresolved_xmin = procArray->fdwxact_unresolved_xmin;
 
     if (RecoveryInProgress())
     {
@@ -1426,6 +1431,15 @@ GetOldestXmin(Relation rel, int flags)
         NormalTransactionIdPrecedes(replication_slot_xmin, result))
         result = replication_slot_xmin;
 
+    /*
+     * Check whether there are unresolved distributed transaction
+     * requiring an older xmin.
+     */
+    if (!(flags & PROCARRAY_FDWXACT_XMIN) &&
+        TransactionIdIsValid(fdwxact_unresolved_xmin) &&
+        NormalTransactionIdPrecedes(fdwxact_unresolved_xmin, result))
+        result = fdwxact_unresolved_xmin;
+
     /*
      * After locks have been released and vacuum_defer_cleanup_age has been
      * applied, check whether we need to back up further to make logical
@@ -3128,6 +3142,38 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
     LWLockRelease(ProcArrayLock);
 }
 
+/*
+ * ProcArraySetFdwXactUnresolvedXmin
+ *
+ * Install limits to future computations fo the xmin horizon to prevent
+ * vacuum clog from affected transactions still needed by resolving
+ * distributed transaction.
+ */
+void
+ProcArraySetFdwXactUnresolvedXmin(TransactionId xmin)
+{
+
+    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+    procArray->fdwxact_unresolved_xmin = xmin;
+    LWLockRelease(ProcArrayLock);
+}
+
+/*
+ * ProcArrayGetFdwXactUnresolvedXmin
+ *
+ * Return the current unresolved xmin limits.
+ */
+TransactionId
+ProcArrayGetFdwXactUnresolvedXmin(void)
+{
+    TransactionId xmin;
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+    xmin = procArray->fdwxact_unresolved_xmin;
+    LWLockRelease(ProcArrayLock);
+
+    return xmin;
+}
 
 #define XidCacheRemove(i) \
     do { \
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..adb276370c 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,6 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+FdwXactLock                            45
+FdwXactResolverLock                    46
+FdwXactResolutionLock                47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fff0628e58..af5e418a03 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -35,6 +35,7 @@
 #include <unistd.h>
 #include <sys/time.h>
 
+#include "access/fdwxact.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
@@ -421,6 +422,10 @@ InitProcess(void)
     MyProc->syncRepState = SYNC_REP_NOT_WAITING;
     SHMQueueElemInit(&(MyProc->syncRepLinks));
 
+    /* Initialize fields for fdw xact */
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+    SHMQueueElemInit(&(MyProc->fdwXactLinks));
+
     /* Initialize fields for group XID clearing. */
     MyProc->procArrayGroupMember = false;
     MyProc->procArrayGroupMemberXid = InvalidTransactionId;
@@ -822,6 +827,9 @@ ProcKill(int code, Datum arg)
     /* Make sure we're out of the sync rep lists */
     SyncRepCleanupAtProcExit();
 
+    /* Make sure we're out of the fdwxact lists */
+    FdwXactCleanupAtProcExit();
+
 #ifdef USE_ASSERT_CHECKING
     {
         int            i;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3b85e48333..a0f8498862 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -36,6 +36,8 @@
 #include "rusagestub.h"
 #endif
 
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
@@ -3029,6 +3031,18 @@ ProcessInterrupts(void)
              */
             proc_exit(1);
         }
+        else if (IsFdwXactResolver())
+            ereport(FATAL,
+                    (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                     errmsg("terminating foreign transaction resolver due to administrator command")));
+        else if (IsFdwXactLauncher())
+        {
+            /*
+             * The foreign transaction launcher can be stopped at any time.
+             * Use exit status 1 so the background worker is restarted.
+             */
+            proc_exit(1);
+        }
         else if (RecoveryConflictPending && RecoveryConflictRetryable)
         {
             pgstat_report_recovery_conflict(RecoveryConflictReason);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba74bf9f7d..d38c33b64c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
@@ -399,6 +400,25 @@ static const struct config_enum_entry synchronous_commit_options[] = {
     {NULL, 0, false}
 };
 
+/*
+ * Although only "required", "prefer", and "disabled" are documented,
+ *  we accept all the likely variants of "on" and "off".
+ */
+static const struct config_enum_entry foreign_twophase_commit_options[] = {
+    {"required", FOREIGN_TWOPHASE_COMMIT_REQUIRED, false},
+    {"prefer", FOREIGN_TWOPHASE_COMMIT_PREFER, false},
+    {"disabled", FOREIGN_TWOPHASE_COMMIT_DISABLED, false},
+    {"on", FOREIGN_TWOPHASE_COMMIT_REQUIRED, false},
+    {"off", FOREIGN_TWOPHASE_COMMIT_DISABLED, false},
+    {"true", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"false", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {"yes", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"no", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {"1", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"0", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {NULL, 0, false}
+};
+
 /*
  * Although only "on", "off", "try" are documented, we accept all the likely
  * variants of "on" and "off".
@@ -725,6 +745,12 @@ const char *const config_group_names[] =
     gettext_noop("Client Connection Defaults / Other Defaults"),
     /* LOCK_MANAGEMENT */
     gettext_noop("Lock Management"),
+    /* FDWXACT */
+    gettext_noop("Foreign Transaction Management"),
+    /* FDWXACT_SETTINGS */
+    gettext_noop("Foreign Transaction Management / Settings"),
+    /* FDWXACT_RESOLVER */
+    gettext_noop("Foreign Transaction Management / Resolver"),
     /* COMPAT_OPTIONS */
     gettext_noop("Version and Platform Compatibility"),
     /* COMPAT_OPTIONS_PREVIOUS */
@@ -2370,6 +2396,52 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    /*
+     * See also CheckRequiredParameterValues() if this parameter changes
+     */
+    {
+        {"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+            NULL
+        },
+        &max_prepared_foreign_xacts,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"foreign_transaction_resolver_timeout", PGC_SIGHUP, FDWXACT_RESOLVER,
+            gettext_noop("Sets the maximum time to wait for foreign transaction resolution."),
+            NULL,
+            GUC_UNIT_MS
+        },
+        &foreign_xact_resolver_timeout,
+        60 * 1000, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"max_foreign_transaction_resolvers", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("Maximum number of foreign transaction resolution processes."),
+            NULL
+        },
+        &max_foreign_xact_resolvers,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"foreign_transaction_resolution_retry_interval", PGC_SIGHUP, FDWXACT_RESOLVER,
+         gettext_noop("Sets the time to wait before retrying to resolve foreign transaction "
+                      "after a failed attempt."),
+         NULL,
+         GUC_UNIT_MS
+        },
+        &foreign_xact_resolution_retry_interval,
+        5000, 1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
 #ifdef LOCK_DEBUG
     {
         {"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
@@ -4413,6 +4485,16 @@ static struct config_enum ConfigureNamesEnum[] =
         NULL, assign_synchronous_commit, NULL
     },
 
+    {
+        {"foreign_twophase_commit", PGC_USERSET, FDWXACT_SETTINGS,
+         gettext_noop("Use of foreign twophase commit for the current transaction."),
+            NULL
+        },
+        &foreign_twophase_commit,
+        FOREIGN_TWOPHASE_COMMIT_DISABLED, foreign_twophase_commit_options,
+        check_foreign_twophase_commit, NULL, NULL
+    },
+
     {
         {"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
             gettext_noop("Allows archiving of WAL files using archive_command."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9541879c1f..22e014aecd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -125,6 +125,8 @@
 #temp_buffers = 8MB            # min 800kB
 #max_prepared_transactions = 0        # zero disables the feature
                     # (change requires restart)
+#max_prepared_foreign_transactions = 0    # zero disables the feature
+                    # (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
 #work_mem = 4MB                # min 64kB
@@ -341,6 +343,20 @@
 #max_sync_workers_per_subscription = 2    # taken from max_logical_replication_workers
 
 
+#------------------------------------------------------------------------------
+# FOREIGN TRANSACTION
+#------------------------------------------------------------------------------
+
+#foreign_twophase_commit = off
+
+#max_foreign_transaction_resolvers = 0        # max number of resolver process
+                        # (change requires restart)
+#foreign_transaction_resolver_timeout = 60s    # in milliseconds; 0 disables
+#foreign_transaction_resolution_retry_interval = 5s    # time to wait before
+                            # retrying to resolve
+                            # foreign transactions
+                            # after a failed attempt
+
 #------------------------------------------------------------------------------
 # QUERY TUNING
 #------------------------------------------------------------------------------
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index f08a49c9dd..dd8878025b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
     probe multixact__checkpoint__done(bool);
     probe twophase__checkpoint__start();
     probe twophase__checkpoint__done();
+    probe fdwxact__checkpoint__start();
+    probe fdwxact__checkpoint__done();
 
     probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
     probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1f6d8939be..49dc5a519f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -210,6 +210,7 @@ static const char *const subdirs[] = {
     "pg_snapshots",
     "pg_subtrans",
     "pg_twophase",
+    "pg_fdwxact",
     "pg_multixact",
     "pg_multixact/members",
     "pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 19e21ab491..9ae3bfe4dd 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,6 +301,8 @@ main(int argc, char *argv[])
            ControlFile->max_wal_senders);
     printf(_("max_prepared_xacts setting:           %d\n"),
            ControlFile->max_prepared_xacts);
+    printf(_("max_prepared_foreign_transactions setting:   %d\n"),
+           ControlFile->max_prepared_foreign_xacts);
     printf(_("max_locks_per_xact setting:           %d\n"),
            ControlFile->max_locks_per_xact);
     printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 2e286f6339..c5ee22132e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -710,6 +710,7 @@ GuessControlValues(void)
     ControlFile.max_wal_senders = 10;
     ControlFile.max_worker_processes = 8;
     ControlFile.max_prepared_xacts = 0;
+    ControlFile.max_prepared_foreign_xacts = 0;
     ControlFile.max_locks_per_xact = 64;
 
     ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -914,6 +915,7 @@ RewriteControlFile(void)
     ControlFile.max_wal_senders = 10;
     ControlFile.max_worker_processes = 8;
     ControlFile.max_prepared_xacts = 0;
+    ControlFile.max_prepared_foreign_xacts = 0;
     ControlFile.max_locks_per_xact = 64;
 
     /* The control file gets flushed here. */
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 120000
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..b616cea347 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -11,6 +11,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact_xlog.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdwxact.h b/src/include/access/fdwxact.h
new file mode 100644
index 0000000000..147d41c708
--- /dev/null
+++ b/src/include/access/fdwxact.h
@@ -0,0 +1,165 @@
+/*
+ * fdwxact.h
+ *
+ * PostgreSQL global transaction manager
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact.h
+ */
+#ifndef FDWXACT_H
+#define FDWXACT_H
+
+#include "access/fdwxact_xlog.h"
+#include "access/xlogreader.h"
+#include "foreign/foreign.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "nodes/execnodes.h"
+#include "storage/backendid.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* fdwXactState */
+#define    FDWXACT_NOT_WAITING        0
+#define    FDWXACT_WAITING            1
+#define    FDWXACT_WAIT_COMPLETE    2
+
+/* Flag passed to FDW transaction management APIs */
+#define FDWXACT_FLAG_ONEPHASE        0x01    /* transaction can commit/rollback
+                                               without preparation */
+
+/* Enum for foreign_twophase_commit parameter */
+typedef enum
+{
+    FOREIGN_TWOPHASE_COMMIT_DISABLED,    /* disable foreign twophase commit */
+    FOREIGN_TWOPHASE_COMMIT_PREFER,        /* use twophase commit where available */
+    FOREIGN_TWOPHASE_COMMIT_REQUIRED    /* all foreign servers have to support
+                                           twophase commit */
+} ForeignTwophaseCommitLevel;
+
+/* Enum to track the status of foreign transaction */
+typedef enum
+{
+    FDWXACT_STATUS_INVALID,
+    FDWXACT_STATUS_INITIAL,
+    FDWXACT_STATUS_PREPARING,        /* foreign transaction is being prepared */
+    FDWXACT_STATUS_PREPARED,        /* foreign transaction is prepared */
+    FDWXACT_STATUS_COMMITTING,        /* foreign prepared transaction is to
+                                     * be committed */
+    FDWXACT_STATUS_ABORTING,        /* foreign prepared transaction is to be
+                                     * aborted */
+    FDWXACT_STATUS_RESOLVED
+} FdwXactStatus;
+
+typedef struct FdwXactData *FdwXact;
+
+/*
+ * Shared memory state of a single foreign transaction.
+ */
+typedef struct FdwXactData
+{
+    FdwXact            fdwxact_free_next;    /* Next free FdwXact entry */
+
+    Oid                dbid;            /* database oid where to find foreign server
+                                     * and user mapping */
+    TransactionId    local_xid;        /* XID of local transaction */
+    Oid                serverid;        /* foreign server where transaction takes
+                                     * place */
+    Oid                userid;            /* user who initiated the foreign
+                                     * transaction */
+    Oid                umid;
+    bool            indoubt;        /* Is an in-doubt transaction? */
+    slock_t            mutex;            /* Protect the above fields */
+
+    /* The status of the foreign transaction, protected by FdwXactLock */
+    FdwXactStatus     status;
+    /*
+     * Note that we need to keep track of two LSNs for each FdwXact. We keep
+     * track of the start LSN because this is the address we must use to read
+     * state data back from WAL when committing a FdwXact. We keep track of
+     * the end LSN because that is the LSN we need to wait for prior to
+     * commit.
+     */
+    XLogRecPtr    insert_start_lsn;        /* XLOG offset of inserting this entry start */
+    XLogRecPtr    insert_end_lsn;        /* XLOG offset of inserting this entry end */
+
+    bool        valid;            /* has the entry been complete and written to file? */
+    BackendId    held_by;        /* backend who are holding */
+    bool        ondisk;            /* true if prepare state file is on disk */
+    bool        inredo;            /* true if entry was added via xlog_redo */
+
+    char        fdwxact_id[FDWXACT_ID_MAX_LEN];        /* prepared transaction identifier */
+} FdwXactData;
+
+/*
+ * Shared memory layout for maintaining foreign prepared transaction entries.
+ * Adding or removing FdwXact entry needs to hold FdwXactLock in exclusive mode,
+ * and iterating fdwXacts needs that in shared mode.
+ */
+typedef struct
+{
+    /* Head of linked list of free FdwXactData structs */
+    FdwXact        free_fdwxacts;
+
+    /* Number of valid foreign transaction entries */
+    int            num_fdwxacts;
+
+    /* Upto max_prepared_foreign_xacts entries in the array */
+    FdwXact        fdwxacts[FLEXIBLE_ARRAY_MEMBER];        /* Variable length array */
+} FdwXactCtlData;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+FdwXactCtlData *FdwXactCtl;
+
+/* State data for foreign transaction resolution, passed to FDW callbacks */
+typedef struct FdwXactRslvState
+{
+    /* Foreign transaction information */
+    char    *fdwxact_id;
+
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+
+    int        flags;            /* OR of FDWXACT_FLAG_xx flags */
+} FdwXactRslvState;
+
+/* GUC parameters */
+extern int    max_prepared_foreign_xacts;
+extern int    max_foreign_xact_resolvers;
+extern int    foreign_xact_resolution_retry_interval;
+extern int    foreign_xact_resolver_timeout;
+extern int    foreign_twophase_commit;
+
+/* Function declarations */
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern void restoreFdwXactData(void);
+extern TransactionId PrescanFdwXacts(TransactionId oldestActiveXid);
+extern void RecoverFdwXacts(void);
+extern void AtEOXact_FdwXacts(bool is_commit);
+extern void AtPrepare_FdwXacts(void);
+extern bool fdwxact_exists(Oid dboid, Oid serverid, Oid userid);
+extern void CheckPointFdwXacts(XLogRecPtr redo_horizon);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FdwXacts(void);
+extern void KnownFdwXactRecreateFiles(XLogRecPtr redo_horizon);
+extern void FdwXactWaitToBeResolved(TransactionId wait_xid, bool commit);
+extern bool FdwXactIsForeignTwophaseCommitRequired(void);
+extern void FdwXactResolveTransactionAndReleaseWaiter(Oid dbid, TransactionId xid,
+                                                      PGPROC *waiter);
+extern bool FdwXactResolveInDoubtTransactions(Oid dbid);
+extern PGPROC *FdwXactGetWaiter(TimestampTz *nextResolutionTs_p, TransactionId *waitXid_p);
+extern void FdwXactCleanupAtProcExit(void);
+extern void RegisterFdwXactByRelId(Oid relid, bool modified);
+extern void RegisterFdwXactByServerId(Oid serverid, bool modified);
+extern void FdwXactMarkForeignServerAccessed(Oid relid, bool modified);
+extern bool check_foreign_twophase_commit(int *newval, void **extra,
+                                          GucSource source);
+extern bool FdwXactWaiterExists(Oid dbid);
+
+#endif   /* FDWXACT_H */
diff --git a/src/include/access/fdwxact_launcher.h b/src/include/access/fdwxact_launcher.h
new file mode 100644
index 0000000000..dd0f5d16ff
--- /dev/null
+++ b/src/include/access/fdwxact_launcher.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_launcher.h
+ *      PostgreSQL foreign transaction launcher definitions
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_launcher.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FDWXACT_LAUNCHER_H
+#define FDWXACT_LAUNCHER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactLauncherRegister(void);
+extern void FdwXactLauncherMain(Datum main_arg);
+extern void FdwXactLauncherRequestToLaunch(void);
+extern void FdwXactLauncherRequestToLaunchForRetry(void);
+extern void FdwXactLaunchOrWakeupResolver(void);
+extern Size FdwXactRslvShmemSize(void);
+extern void FdwXactRslvShmemInit(void);
+extern bool IsFdwXactLauncher(void);
+
+
+#endif    /* FDWXACT_LAUNCHER_H */
diff --git a/src/include/access/fdwxact_resolver.h b/src/include/access/fdwxact_resolver.h
new file mode 100644
index 0000000000..2607654024
--- /dev/null
+++ b/src/include/access/fdwxact_resolver.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_resolver.h
+ *      PostgreSQL foreign transaction resolver definitions
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_resolver.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_RESOLVER_H
+#define FDWXACT_RESOLVER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactResolverMain(Datum main_arg);
+extern bool IsFdwXactResolver(void);
+
+extern int foreign_xact_resolver_timeout;
+
+#endif        /* FDWXACT_RESOLVER_H */
diff --git a/src/include/access/fdwxact_xlog.h b/src/include/access/fdwxact_xlog.h
new file mode 100644
index 0000000000..39ca66beef
--- /dev/null
+++ b/src/include/access/fdwxact_xlog.h
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_xlog.h
+ *      Foreign transaction XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_XLOG_H
+#define FDWXACT_XLOG_H
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDWXACT_INSERT    0x00
+#define XLOG_FDWXACT_REMOVE    0x10
+
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define FDWXACT_ID_MAX_LEN 200
+
+/*
+ * On disk file structure, also used to WAL
+ */
+typedef struct
+{
+    TransactionId local_xid;
+    Oid            dbid;            /* database oid where to find foreign server
+                                 * and user mapping */
+    Oid            serverid;        /* foreign server where transaction takes
+                                 * place */
+    Oid            userid;            /* user who initiated the foreign transaction */
+    Oid            umid;
+    char        fdwxact_id[FDWXACT_ID_MAX_LEN]; /* foreign txn prepare id */
+} FdwXactOnDiskData;
+
+typedef struct xl_fdwxact_remove
+{
+    TransactionId xid;
+    Oid            serverid;
+    Oid            userid;
+    Oid            dbid;
+    bool        force;
+} xl_fdwxact_remove;
+
+extern void fdwxact_redo(XLogReaderState *record);
+extern void fdwxact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdwxact_identify(uint8 info);
+
+#endif    /* FDWXACT_XLOG_H */
diff --git a/src/include/access/resolver_internal.h b/src/include/access/resolver_internal.h
new file mode 100644
index 0000000000..55fc970b69
--- /dev/null
+++ b/src/include/access/resolver_internal.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver_internal.h
+ *      Internal headers shared by fdwxact resolvers.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/resovler_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RESOLVER_INTERNAL_H
+#define RESOLVER_INTERNAL_H
+
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/timestamp.h"
+
+/*
+ * Each foreign transaction resolver has a FdwXactResolver struct in
+ * shared memory.  This struct is protected by FdwXactResolverLaunchLock.
+ */
+typedef struct FdwXactResolver
+{
+    pid_t    pid;    /* this resolver's PID, or 0 if not active */
+    Oid        dbid;    /* database oid */
+
+    /* Indicates if this slot is used of free */
+    bool    in_use;
+
+    /* Stats */
+    TimestampTz    last_resolved_time;
+
+    /* Protect shared variables shown above */
+    slock_t    mutex;
+
+    /*
+     * Pointer to the resolver's patch. Used by backends to wake up this
+     * resolver when it has work to do. NULL if the resolver isn't active.
+     */
+    Latch    *latch;
+} FdwXactResolver;
+
+/* There is one FdwXactRslvCtlData struct for the whole database cluster */
+typedef struct FdwXactRslvCtlData
+{
+    /* Foreign transaction resolution queue. Protected by FdwXactLock */
+    SHM_QUEUE    fdwxact_queue;
+
+    /* Supervisor process and latch */
+    pid_t        launcher_pid;
+    Latch        *launcher_latch;
+
+    FdwXactResolver resolvers[FLEXIBLE_ARRAY_MEMBER];
+} FdwXactRslvCtlData;
+#define SizeOfFdwXactRslvCtlData \
+    (offsetof(FdwXactRslvCtlData, resolvers) + sizeof(FdwXactResolver))
+
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+extern FdwXactResolver *MyFdwXactResolver;
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+#endif    /* RESOLVER_INTERNAL_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..5798b4cd99 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL,
NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDWXACT_ID, "Foreign Transactions", fdwxact_redo, fdwxact_desc, fdwxact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 02b5315c43..e8c094d708 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -36,6 +36,7 @@ extern void PostPrepare_Twophase(void);
 
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid, bool lock_held);
+extern bool    TwoPhaseExists(TransactionId xid);
 
 extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
                                          TimestampTz prepared_at,
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb5c4935d2..a75e6998f0 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -108,6 +108,13 @@ extern int    MyXactFlags;
  */
 #define XACT_FLAGS_WROTENONTEMPREL                (1U << 2)
 
+/*
+ * XACT_FLAGS_FDWNONPREPARE - set when we wrote data on foreign table of which
+ * server isn't capable of two-phase commit
+ * relation.
+ */
+#define XACT_FLAGS_FDWNOPREPARE                    (1U << 3)
+
 /*
  *    start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e295dc65fb..d1ce20242f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -232,6 +232,7 @@ typedef struct xl_parameter_change
     int            max_worker_processes;
     int            max_wal_senders;
     int            max_prepared_xacts;
+    int            max_prepared_foreign_xacts;
     int            max_locks_per_xact;
     int            wal_level;
     bool        wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index cf7d4485e9..f2174a0208 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -179,6 +179,7 @@ typedef struct ControlFileData
     int            max_worker_processes;
     int            max_wal_senders;
     int            max_prepared_xacts;
+    int            max_prepared_foreign_xacts;
     int            max_locks_per_xact;
     bool        track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..1072c38aa6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5184,6 +5184,13 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o}',
   proargnames =>
'{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}',
   prosrc => 'pg_stat_get_subscription' },
+{ oid => '9705', descr => 'statistics: information about foreign transaction resolver',
+  proname => 'pg_stat_get_foreign_xact', proisstrict => 'f', provolatile => 's',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,oid,timestamptz}',
+  proargmodes => '{o,o,o}',
+  proargnames => '{pid,dbid,last_resolved_time}',
+  prosrc => 'pg_stat_get_foreign_xact' },
 { oid => '2026', descr => 'statistics: current backend PID',
   proname => 'pg_backend_pid', provolatile => 's', proparallel => 'r',
   prorettype => 'int4', proargtypes => '', prosrc => 'pg_backend_pid' },
@@ -5897,6 +5904,24 @@
   proargnames => '{type,object_names,object_args,classid,objid,objsubid}',
   prosrc => 'pg_get_object_address' },
 
+{ oid => '9706', descr => 'view foreign transactions',
+  proname => 'pg_foreign_xacts', prorows => '1000', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,xid,oid,oid,text,bool,text}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{dbid,xid,serverid,userid,status,in_doubt,identifier}',
+  prosrc => 'pg_foreign_xacts' },
+{ oid => '9707', descr => 'remove foreign transaction without resolution',
+  proname => 'pg_remove_foreign_xact', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'xid oid oid',
+  proargnames => '{xid,serverid,userid}',
+  prosrc => 'pg_remove_foreign_xact' },
+{ oid => '9708', descr => 'resolve one foreign transaction',
+  proname => 'pg_resolve_foreign_xact', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'xid oid oid',
+  proargnames => '{xid,serverid,userid}',
+  prosrc => 'pg_resolve_foreign_xact' },
+
 { oid => '2079', descr => 'is table visible in search path?',
   proname => 'pg_table_is_visible', procost => '10', provolatile => 's',
   prorettype => 'bool', proargtypes => 'oid', prosrc => 'pg_table_is_visible' },
@@ -6015,6 +6040,10 @@
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
+{ oid => '9709',
+  descr => 'stop a foreign transaction resolver process running on the given database',
+  proname => 'pg_stop_foreing_xact_resolver', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'oid', prosrc => 'pg_stop_foreign_xact_resolver'},
 
 { oid => '3165', descr => 'difference in bytes, given two wal locations',
   proname => 'pg_wal_lsn_diff', prorettype => 'numeric',
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 822686033e..c7b33d72ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdwxact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
@@ -169,6 +170,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef void (*PrepareForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef void (*CommitForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef void (*RollbackForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef char *(*GetPrepareId_function) (TransactionId xid, Oid serverid,
+                                        Oid userid, int *prep_id_len);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -236,6 +242,12 @@ typedef struct FdwRoutine
     /* Support functions for IMPORT FOREIGN SCHEMA */
     ImportForeignSchema_function ImportForeignSchema;
 
+    /* Support functions for transaction management */
+    PrepareForeignTransaction_function PrepareForeignTransaction;
+    CommitForeignTransaction_function CommitForeignTransaction;
+    RollbackForeignTransaction_function RollbackForeignTransaction;
+    GetPrepareId_function GetPrepareId;
+
     /* Support functions for parallelism under Gather node */
     IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
     EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/foreign/foreign.h b/src/include/foreign/foreign.h
index 4de157c19c..91c2276915 100644
--- a/src/include/foreign/foreign.h
+++ b/src/include/foreign/foreign.h
@@ -69,6 +69,7 @@ extern ForeignServer *GetForeignServerExtended(Oid serverid,
                                                bits16 flags);
 extern ForeignServer *GetForeignServerByName(const char *name, bool missing_ok);
 extern UserMapping *GetUserMapping(Oid userid, Oid serverid);
+extern UserMapping *GetUserMappingByOid(Oid umid);
 extern ForeignDataWrapper *GetForeignDataWrapper(Oid fdwid);
 extern ForeignDataWrapper *GetForeignDataWrapperExtended(Oid fdwid,
                                                          bits16 flags);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d823d..d82d8f7abc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -776,6 +776,8 @@ typedef enum
     WAIT_EVENT_BGWRITER_HIBERNATE,
     WAIT_EVENT_BGWRITER_MAIN,
     WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_FDWXACT_RESOLVER_MAIN,
+    WAIT_EVENT_FDWXACT_LAUNCHER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
     WAIT_EVENT_PGSTAT_MAIN,
@@ -853,7 +855,9 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_FDWXACT,
+    WAIT_EVENT_FDWXACT_RESOLUTION
 } WaitEventIPC;
 
 /* ----------
@@ -933,6 +937,9 @@ typedef enum
     WAIT_EVENT_TWOPHASE_FILE_READ,
     WAIT_EVENT_TWOPHASE_FILE_SYNC,
     WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_FDWXACT_FILE_READ,
+    WAIT_EVENT_FDWXACT_FILE_WRITE,
+    WAIT_EVENT_FDWXACT_FILE_SYNC,
     WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
     WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
     WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 281e1db725..c802201193 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/xlogdefs.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "storage/latch.h"
 #include "storage/lock.h"
@@ -152,6 +153,16 @@ struct PGPROC
     int            syncRepState;    /* wait state for sync rep */
     SHM_QUEUE    syncRepLinks;    /* list link if process is in syncrep queue */
 
+    /*
+     * Info to allow us to wait for foreign transaction to be resolved, if
+     * needed.
+     */
+    TransactionId    fdwXactWaitXid;    /* waiting for foreign transaction involved with
+                                     * this transaction id to be resolved */
+    int            fdwXactState;    /* wait state for foreign transaction resolution */
+    SHM_QUEUE    fdwXactLinks;    /* list link if process is in queue */
+    TimestampTz fdwXactNextResolutionTs;
+
     /*
      * All PROCLOCK objects for locks held or awaited by this backend are
      * linked into one of these lists, according to the partition number of
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8f67b860e7..deb293c1a9 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -36,6 +36,8 @@
 
 #define        PROCARRAY_SLOTS_XMIN            0x20    /* replication slot xmin,
                                                      * catalog_xmin */
+#define        PROCARRAY_FDWXACT_XMIN            0x40    /* unresolved distributed
+                                                       transaciton xmin */
 /*
  * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
  * PGXACT->vacuumFlags. Other flags are used for different purposes and
@@ -125,4 +127,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
                                             TransactionId *catalog_xmin);
 
+
+extern void ProcArraySetFdwXactUnresolvedXmin(TransactionId xmin);
+extern TransactionId ProcArrayGetFdwXactUnresolvedXmin(void);
 #endif                            /* PROCARRAY_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d68976fafa..d5fec50969 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -96,6 +96,9 @@ enum config_group
     CLIENT_CONN_PRELOAD,
     CLIENT_CONN_OTHER,
     LOCK_MANAGEMENT,
+    FDWXACT,
+    FDWXACT_SETTINGS,
+    FDWXACT_RESOLVER,
     COMPAT_OPTIONS,
     COMPAT_OPTIONS_PREVIOUS,
     COMPAT_OPTIONS_CLIENT,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c9cc569404..ed229d5a67 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1341,6 +1341,14 @@ pg_file_settings| SELECT a.sourcefile,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
+pg_foreign_xacts| SELECT f.dbid,
+    f.xid,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.in_doubt,
+    f.identifier
+   FROM pg_foreign_xacts() f(dbid, xid, serverid, userid, status, in_doubt, identifier);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
     ARRAY( SELECT pg_auth_members.member
@@ -1841,6 +1849,11 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
+pg_stat_foreign_xact| SELECT r.pid,
+    r.dbid,
+    r.last_resolved_time
+   FROM pg_stat_get_foreign_xact() r(pid, dbid, last_resolved_time)
+  WHERE (r.pid IS NOT NULL);
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
     s.gss_princ AS principal,
-- 
2.23.0

From 3363abd531595233fb59e0ab6078a011ab8060e9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:08 +0900
Subject: [PATCH v26 3/5] Documentation update.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 doc/src/sgml/catalogs.sgml                | 145 +++++++++++++
 doc/src/sgml/config.sgml                  | 146 ++++++++++++-
 doc/src/sgml/distributed-transaction.sgml | 158 +++++++++++++++
 doc/src/sgml/fdwhandler.sgml              | 236 ++++++++++++++++++++++
 doc/src/sgml/filelist.sgml                |   1 +
 doc/src/sgml/func.sgml                    |  89 ++++++++
 doc/src/sgml/monitoring.sgml              |  60 ++++++
 doc/src/sgml/postgres.sgml                |   1 +
 doc/src/sgml/storage.sgml                 |   6 +
 9 files changed, 841 insertions(+), 1 deletion(-)
 create mode 100644 doc/src/sgml/distributed-transaction.sgml

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..1b720da03d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8267,6 +8267,11 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
       <entry>open cursors</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-foreign-xacts"><structname>pg_foreign_xacts</structname></link></entry>
+      <entry>foreign transactions</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-file-settings"><structname>pg_file_settings</structname></link></entry>
       <entry>summary of configuration file contents</entry>
@@ -9712,6 +9717,146 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  </sect1>
 
+ <sect1 id="view-pg-foreign-xacts">
+  <title><structname>pg_foreign_xacts</structname></title>
+
+  <indexterm zone="view-pg-foreign-xacts">
+   <primary>pg_foreign_xacts</primary>
+  </indexterm>
+
+  <para>
+   The view <structname>pg_foreign_xacts</structname> displays
+   information about foreign transactions that are opened on
+   foreign servers for atomic distributed transaction commit (see
+   <xref linkend="atomic-commit"/> for details).
+  </para>
+
+  <para>
+   <structname>pg_foreign_xacts</structname> contains one row per foreign
+   transaction.  An entry is removed when the foreign transaction is
+   committed or rolled back.
+  </para>
+
+  <table>
+   <title><structname>pg_foreign_xacts</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><structfield>dbid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link
linkend="catalog-pg-database"><structname>pg_database</structname></link>.oid</literal></entry>
+      <entry>
+       OID of the database which the foreign transaction resides in
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>xid</structfield></entry>
+      <entry><type>xid</type></entry>
+      <entry></entry>
+      <entry>
+       Numeric transaction identifier with that this foreign transaction
+       associates
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>serverid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link
linkend="catalog-pg-foreign-server"><structname>pg_foreign_server</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the foreign server on that the foreign transaction is prepared
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>userid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="view-pg-user"><structname>pg_user</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the user that prepared this foreign transaction.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       Status of foreign transaction. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>initial</literal> : Initial status.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>preparing</literal> : This foreign transaction is being prepared.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>prepared</literal> : This foreign transaction has been prepared.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>committing</literal> : This foreign transcation is being committed.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>aborting</literal> : This foreign transaction is being aborted.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>resolved</literal> : This foreign transaction has been resolved.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>in_doubt</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       If <literal>true</literal> this foreign transaction is in-dbout status and
+       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
+       function.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>identifier</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       The identifier of the prepared foreign transaction.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   When the <structname>pg_prepared_xacts</structname> view is accessed, the
+   internal transaction manager data structures are momentarily locked, and
+   a copy is made for the view to display.  This ensures that the
+   view produces a consistent set of results, while not blocking
+   normal operations longer than necessary.  Nonetheless
+   there could be some impact on database performance if this view is
+   frequently accessed.
+  </para>
+
+ </sect1>
+
  <sect1 id="view-pg-publication-tables">
   <title><structname>pg_publication_tables</structname></title>
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 53ac14490a..69778750f3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4378,7 +4378,6 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
 
      </variablelist>
     </sect2>
-
    </sect1>
 
    <sect1 id="runtime-config-query">
@@ -8818,6 +8817,151 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
      </variablelist>
    </sect1>
 
+   <sect1 id="runtime-config-distributed-transaction">
+    <title>Distributed Transaction Management</title>
+
+    <sect2 id="runtime-config-distributed-transaction-settings">
+     <title>Setting</title>
+     <variablelist>
+
+      <varlistentry id="guc-foreign-twophase-commit" xreflabel="foreign_twophse_commit">
+       <term><varname>foreign_twophase_commit</varname> (<type>enum</type>)
+        <indexterm>
+         <primary><varname>foreign_twophase_commit</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies whether transaction commit will wait for all involving foreign
+         transaction to be resolved before the command returns a "success"
+         indication to the client. Valid values are <literal>required</literal>,
+         <literal>prefer</literal> and <literal>disabled</literal>. The default
+         setting is <literal>disabled</literal>. Setting to
+         <literal>disabled</literal> don't use two-phase commit protocol to
+         commit or rollback distributed transactions. When set to
+         <literal>required</literal> the distributed transaction strictly
+         requires that all written servers can use two-phase commit protocol.
+         That is, the distributed transaction cannot commit if even one server
+         does not support the transaction management callback routines
+         (described in <xref linkend="fdw-callbacks-transaction-managements"/>).
+         When set to <literal>prefer</literal> the distributed transaction use
+         two-phase commit protocol on only servers where available and commit on
+         others. Note that when <literal>disabled</literal> or
+         <literal>prefer</literal> there can be risk of database consistency
+         among all servers that involved in the distributed transaction when some
+         foreign server crashes during committing the distributed transaction.
+        </para>
+
+        <para>
+         Both <varname>max_prepared_foreign_transactions</varname> and
+         <varname>max_foreign_transaction_resolvers</varname> must be non-zero
+         value to set this parameter either <literal>required</literal> or
+         <literal>prefer</literal>.
+        </para>
+
+        <para>
+         This parameter can be changed at any time; the behavior for any one
+         transaction is determined by the setting in effect when it commits.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+       <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>max_prepared_foreign_transactions</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of foreign transactions that can be prepared
+         simultaneously. A single local transaction can give rise to multiple
+         foreign transaction. If <literal>N</literal> local transactions each
+         across <literal>K</literal> foreign server this value need to be set
+         <literal>N * K</literal>, not just <literal>N</literal>.
+         This parameter can only be set at server start.
+        </para>
+        <para>
+         When running a standby server, you must set this parameter to the
+         same or higher value than on the master server. Otherwise, queries
+         will not be allowed in the standby server.
+        </para>
+       </listitem>
+      </varlistentry>
+
+     </variablelist>
+    </sect2>
+
+    <sect2 id="runtime-config-foreign-transaction-resolver">
+     <title>Foreign Transaction Resolvers</title>
+
+     <para>
+      These settings control the behavior of a foreign transaction resolver.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-max-foreign-transaction-resolvers" xreflabel="max_foreign_transaction_resolvers">
+       <term><varname>max_foreign_transaction_resolvers</varname> (<type>int</type>)
+        <indexterm>
+         <primary><varname>max_foreign_transaction_resolvers</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies maximum number of foreign transaction resolution workers. A foreign transaction
+         resolver is responsible for foreign transaction resolution on one database.
+        </para>
+        <para>
+         Foreign transaction resolution workers are taken from the pool defined by
+         <varname>max_worker_processes</varname>.
+        </para>
+        <para>
+         The default value is 0.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-foreign-transaction-resolution-rety-interval"
xreflabel="foreign_transaction_resolution_retry_interval">
+       <term><varname>foreign_transaction_resolution_retry_interval</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolution_interval</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specify how long the foreign transaction resolver should wait when the last resolution
+         fails before retrying to resolve foreign transaction. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command line.
+        </para>
+        <para>
+         The default value is 10 seconds.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-foreign-transaction-resolver-timeout" xreflabel="foreign_transaction_resolver_timeout">
+       <term><varname>foreign_transaction_resolver_timeout</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolver_timeout</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Terminate foreign transaction resolver processes that don't have any foreign
+         transactions to resolve longer than the specified number of milliseconds.
+         A value of zero disables the timeout mechanism, meaning it connects to one
+         database until stopping manually. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command line.
+        </para>
+        <para>
+         The default value is 60 seconds.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </sect2>
+   </sect1>
+
    <sect1 id="runtime-config-compatible">
     <title>Version and Platform Compatibility</title>
 
diff --git a/doc/src/sgml/distributed-transaction.sgml b/doc/src/sgml/distributed-transaction.sgml
new file mode 100644
index 0000000000..350b1afe68
--- /dev/null
+++ b/doc/src/sgml/distributed-transaction.sgml
@@ -0,0 +1,158 @@
+<!-- doc/src/sgml/distributed-transaction.sgml -->
+
+<chapter id="distributed-transaction">
+ <title>Distributed Transaction</title>
+
+ <para>
+  A distributed transaction is a transaction in which two or more network hosts
+  are involved. <productname>PostgreSQL</productname>'s global Transaction
+  manager supports distributed transactions that access foreign servers using
+  Foreign Data Wrappers. The global transaction manager is responsible for
+  managing transactions on foreign servers.
+ </para>
+
+ <sect1 id="atomic-commit">
+  <title>Atomic Commit</title>
+
+  <para>
+   Atomic commit of distributed transaction is an operation that applies a set
+   of changes as a single operation globally. This guarantees all-or-nothing
+   results for the changes on all remote hosts involved in.
+   <productname>PostgreSQL</productname> provides a way to perform read-write
+   transactions with foreign resources using foreign data wrappers.
+   Using the <productname>PostgreSQL</productname>'s atomic commit ensures that
+   all changes on foreign servers end in either commit or rollback using the
+   transaction callback routines
+   (see <xref linkend="fdw-callbacks-transaction-managements"/>).
+  </para>
+
+  <sect2>
+   <title>Atomic Commit Using Two-phase Commit Protocol</title>
+
+   <para>
+    To achieve commit among all foreign servers automatially,
+    <productname>PostgreSQL</productname> employs two-phase commit protocol,
+    which is a type of atomic commitment protocol (ACP).
+    A <productname>PostgreSQL</productname> server that received SQL is called
+    <firstterm>coordinator node</firstterm> who is responsible for coordinating
+    all the partipanting transactions. Using two-phase commit protocol, the commit
+    sequence of distributed transaction performs with the following steps.
+    <orderedlist>
+     <listitem>
+      <para>
+       Prepare all transactions on foreign servers.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Commit locally.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Resolve all prepared transaction on foreign servers.
+      </para>
+     </listitem>
+    </orderedlist>
+
+   </para>
+
+   <para>
+    At the first step, <productname>PostgreSQL</productname> distributed
+    transaction manager prepares all transaction on the foreign servers if
+    two-phase commit is required. Two-phase commit is required when the
+    transaction modifies data on two or more servers including the local server
+    itself and <xref linkend="guc-foreign-twophase-commit"/>is
+    <literal>required</literal> or <literal>prefer</literal>. If all preparations
+    on foreign servers got successful go to the next step. Any failure happens
+    in this step <productname>PostgreSQL</productname> changes to rollback, then
+    rollback all transactions on both local and foreign servers.
+   </para>
+
+   <para>
+    At the local commit step, <productname>PostgreSQL</productname> commit the
+    transaction locally. Any failure happens in this step
+    <productname>PostgreSQL</productname> changes rollback, then rollback all
+    transactions on both local and foreign servers.
+   </para>
+
+   <para>
+    At the final step, prepared transactions are resolved by a foreign transaction
+    resolver process.
+   </para>
+  </sect2>
+
+  <sect2 id="atomic-commit-transaction-resolution">
+   <title>Foreign Transaction Resolver Processes</title>
+
+   <para>
+    Foreign transaction resolver processes are auxiliary processes that is
+    responsible for foreign transaction resolution. They commit or rollback all
+    prepared transaction on foreign servers if the coordinator received agreement
+    messages from all foreign servers during the first step.
+   </para>
+
+   <para>
+    One foreign transaction resolver is responsible for transaction resolutions
+    on one database of the coordinator side. On failure during resolution, they
+    retries to resolve at an interval of
+    <varname>foreign_transaction_resolution_interval</varname> time.
+   </para>
+
+   <note>
+    <para>
+     During a foreign transaction resolver process connecting to the database,
+     database cannot be dropped. So to drop the database, you can call
+     <function>pg_stop_foreign_xact_resovler</function> function before dropping
+     the database.
+    </para>
+   </note>
+  </sect2>
+
+  <sect2 id="atomic-commit-in-doubt-transaction">
+   <title>Manual Resolution of In-Doubt Transactions</title>
+
+   <para>
+    The atomic commit mechanism ensures that all foreign servers either commit
+    or rollback using two-phase commit protocol. However, distributed transactions
+    become <firstterm>in-doubt</firstterm> in three cases: where the foreign
+    server crashed or lost the connectibility to it during preparing foreign
+    transaction, where the coordinator node crashed during either preparing or
+    resolving distributed transaction and where user canceled the query. You can
+    check in-doubt transaction in <xref linkend="pg-stat-foreign-xact-view"/>
+    view. These foreign transactions need to be resolved by using
+    <function>pg_resolve_foriegn_xact</function> function.
+    <productname>PostgreSQL</productname> doesn't have facilities to automatially
+    resolve in-doubt transactions. These behavior might change in a future release.
+   </para>
+  </sect2>
+
+  <sect2 id="atomic-commit-monitoring">
+   <title>Monitoring</title>
+   <para>
+    The monitoring information about foreign transaction resolvers is visible in
+    <link linkend="pg-stat-foreign-xact-view"><literal>pg_stat_foreign_xact</literal></link>
+    view. This view contains one row for every foreign transaction resolver worker.
+   </para>
+  </sect2>
+
+  <sect2>
+   <title>Configuration Settings</title>
+
+   <para>
+    Atomic commit requires several configuration options to be set.
+   </para>
+
+   <para>
+    On the coordinator side, <xref linkend="guc-max-prepared-foreign-transactions"/> and
+    <xref linkend="guc-max-foreign-transaction-resolvers"/> must be non-zero value.
+    Additionally the <varname>max_worker_processes</varname> may need to be adjusted to
+    accommodate for foreign transaction resolver workers, at least
+    (<varname>max_foreign_transaction_resolvers</varname> + <literal>1</literal>).
+    Note that some extensions and parallel queries also take worker slots from
+    <varname>max_worker_processes</varname>.
+   </para>
+
+  </sect2>
+ </sect1>
+</chapter>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 6587678af2..dd0358ef22 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1415,6 +1415,127 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
     </para>
    </sect2>
 
+   <sect2 id="fdw-callbacks-transaction-managements">
+    <title>FDW Routines For Transaction Managements</title>
+
+    <para>
+     Transaction management callbacks are used for doing commit, rollback and
+     prepare the foreign transaction. If an FDW wishes that its foreign
+     transaction is managed by <productname>PostgreSQL</productname>'s global
+     transaction manager it must provide both
+     <function>CommitForeignTransaction</function> and
+     <function>RollbackForeignTransaction</function>. In addition, if an FDW
+     wishes to support <firstterm>atomic commit</firstterm> (as described in
+     <xref linkend="fdw-transaction-managements"/>), it must provide
+     <function>PrepareForeignTransaction</function> as well and can provide
+     <function>GetPrepareId</function> callback optionally.
+    </para>
+
+    <para>
+<programlisting>
+void
+PrepareForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Prepare the transaction on the foreign server. This function is called at the
+    pre-commit phase of the local transactions if foreign twophase commit is
+    required. This function is used only for distribute transaction management
+    (see <xref linkend="distributed-transaction"/>).
+    </para>
+
+    <para>
+     Note that this callback function is always executed by backend processes.
+    </para>
+    <para>
+<programlisting>
+bool
+CommitForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Commit the foreign transaction. This function is called either at
+    the pre-commit phase of the local transaction if the transaction
+    can be committed in one-phase or at the post-commit phase if
+    two-phase commit is required. If <literal>frstate->flag</literal> has
+    the flag <literal>FDW_XACT_FLAG_ONEPHASE</literal> the transaction
+    can be committed in one-phase, this function must commit the prepared
+    transaction identified by <literal>frstate->fdwxact_id</literal>.
+    </para>
+
+    <para>
+     The foreign transaction identified by <literal>frstate->fdwxact_id</literal>
+     might not exist on the foreign servers. This can happen when, for instance,
+     <productname>PostgreSQL</productname> server crashed during preparing or
+     committing the foreign tranasction. Therefore, this function needs to
+     tolerate the undefined object error
+     (<literal>ERRCODE_UNDEFINED_OBJECT</literal>) rather than raising an error.
+    </para>
+
+    <para>
+     Note that all cases except for calling <function>pg_resolve_fdwxact</function>
+     SQL function, this callback function is executed by foreign transaction
+     resolver processes.
+    </para>
+    <para>
+<programlisting>
+bool
+RollbackForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Rollback the foreign transaction. This function is called either at
+    the end of the local transaction after rolled back locally. The foreign
+    transactions are rolled back when user requested rollbacking or when
+    any error occurs during the transaction. This function must be tolerate to
+    being called recursively if any error occurs during rollback the foreign
+    transaction. So you would need to track recursion and prevent being called
+    infinitely. If <literal>frstate->flag</literal> has the flag
+    <literal>FDW_XACT_FLAG_ONEPHASE</literal> the transaction can be rolled
+    back in one-phase, otherwise this function must rollback the prepared
+    transaction identified by <literal>frstate->fdwxact_id</literal>.
+    </para>
+
+    <para>
+     The foreign transaction identified by <literal>frstate->fdwxact_id</literal>
+     might not exist on the foreign servers. This can happen when, for instance,
+     <productname>PostgreSQL</productname> server crashed during preparing or
+     committing the foreign tranasction. Therefore, this function needs to
+     tolerate the undefined object error
+     (<literal>ERRCODE_UNDEFINED_OBJECT</literal>) rather than raising an error.
+    </para>
+
+    <para>
+     Note that all cases except for calling <function>pg_resolve_fdwxact</function>
+     SQL function, this callback function is executed by foreign transaction
+     resolver processes.
+    </para>
+    <para>
+<programlisting>
+char *
+GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int *prep_id_len);
+</programlisting>
+    Return null terminated string that represents prepared transaction identifier
+    with its length <varname>*prep_id_len</varname>.
+    This optional function is called during executor startup for once per the
+    foreign server. Note that the transaction identifier must be string literal,
+    less than <symbol>NAMEDATALEN</symbol> bytes long and should not be same
+    as any other concurrent prepared transaction id. If this callback routine
+    is not supported, <productname>PostgreSQL</productname>'s  distributed
+    transaction manager generates an unique identifier with in the form of
+    <literal>fx_<random value up to 2<superscript>31</superscript>>_<server oid>_<user
oid></literal>.
+    </para>
+
+    <para>
+     Note that this callback function is always executed by backend processes.
+    </para>
+
+    <note>
+     <para>
+      Functions <function>PrepareForeignTransaction</function>,
+      <function>CommitForeignTransaction</function> and
+      <function>RolblackForeignTransaction</function> are called
+      at outside of a valid transaction state. So please note that
+      you cannot use functions that use the system catalog cache
+      such as Foreign Data Wrapper helper functions described in
+      <xref linkend="fdw-helpers"/>.
+     </para>
+    </note>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
@@ -1894,4 +2015,119 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+  <sect1 id="fdw-transaction-managements">
+   <title>Transaction managements for Foreign Data Wrappers</title>
+   <para>
+    If a FDW's server supports transaction, it is usually worthwhile for the
+    FDW to manage transaction opened on the foreign server. The FDW callback
+    function <literal>CommitForeignTransaction</literal>,
+    <literal>RollbackForeignTransaction</literal> and
+    <literal>PrepareForeignTransaction</literal> are used to manage Transaction
+    management and must fit into the working of the
+    <productname>PostgreSQL</productname> transaction processing.
+   </para>
+
+   <para>
+    The information in <literal>FdwXactRslvState</literal> can be used to get
+    information of foreign server being processed such as server name, OID of
+    server, user and user mapping. The <literal>flags</literal> has contains flag
+    bit describing the foreign transaction state for transaction management.
+   </para>
+
+   <sect2 id="fdw-transaction-commit-rollback">
+    <title>Commit And Rollback Single Foreign Transaction</title>
+    <para>
+     The FDW callback function <literal>CommitForeignTransaction</literal>
+     and <literal>RollbackForeignTransaction</literal> can be used to commit
+     and rollback the foreign transaction. During transaction commit, the core
+     transaction manager calls <literal>CommitForeignTransaction</literal> function
+     in the pre-commit phase and calls
+     <literal>RollbackForeignTransaction</literal> function in the post-rollback
+     phase.
+    </para>
+   </sect2>
+
+   <sect2 id="fdw-transaction-distributed-transaction-commit">
+    <title>Atomic Commit And Rollback Distributed Transaction</title>
+    <para>
+     In addition to simply commit and rollback foreign transactions described at
+     <xref linkend="fdw-transaction-commit-rollback"/>,
+     <productname>PostgreSQL</productname> global transaction manager enables
+     distributed transactions to atomically commit and rollback among all foreign
+     servers, which is as known as atomic commit in literature. To achieve atomic
+     commit, <productname>PostgreSQL</productname> employs two-phase commit
+     protocol, which is a type of atomic commitment protocol. Every FDWs that wish
+     to support two-phase commit protocol are required to have the FDW callback
+     function <function>PrepareForeignTransaction</function> and optionally
+     <function>GetPrepareId</function>, in addition to
+     <function>CommitForeignTransaction</function> and
+     <function>RollbackForeignTransaction</function>
+     (see <xref linkend="fdw-callbacks-transaction-managements"/> for details).
+    </para>
+
+    <para>
+     An example of distributed transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+     When the core executor access the foreign servers, foreign servers whose FDW
+     supports transaction management callback routines is registered as a participant.
+     During registration, <function>GetPrepareId</function> is called if provided to
+     generate an unique transaction identifer.
+    </para>
+
+    <para>
+     During pre-commit phase of local transaction, the foreign transaction manager
+     persists the foreign transaction information to the disk and WAL, and then
+     prepare all foreign transaction by calling
+     <function>PrepareForeignTransaction</function> if two-phase commit protocol
+     is required. Two-phase commit is required when the transaction modified data
+     on more than one servers including the local server itself and user requests
+     foreign twophase commit (see <xref linkend="guc-foreign-twophase-commit"/>).
+    </para>
+
+    <para>
+     <productname>PostgreSQL</productname> can commit locally and go to the next
+     step if and only if all foreign transactions are prepared successfully.
+     If any failure happens or user requests to cancel during preparation,
+     the distributed transaction manager changes over rollback and calls
+     <function>RollbackForeignTransaction</function>.
+    </para>
+
+    <para>
+     Note that when <literal>(frstate->flags & FDWXACT_FLAG_ONEPHASE)</literal>
+     is true, both <literal>CommitForeignTransaction</literal> function and
+     <literal>RollbackForeignTransaction</literal> function should commit and
+     rollback directly, rather than processing prepared transactions. This can
+     happen when two-phase commit is not required or foreign server is not
+     modified with in the transaction.
+    </para>
+
+    <para>
+     Once all foreign transaction is prepared, the core transaction manager commits
+     locally. After that the transaction commit waits for all prepared foreign
+     transaction to be committed before completetion. After all prepared foreign
+     transactions are resolved the transaction commit completes.
+    </para>
+
+    <para>
+     One foreign transaction resolver process is responsible for foreign
+     transaction resolution on a database. Foreign transaction resolver process
+     calls either <function>CommitForeignTransaction</function> or
+     <function>RollbackForeignTransaction</function> to resolve foreign
+     transaction identified by <literal>frstate->fdwxact_id</literal>. If failed
+     to resolve, resolver process will exit with an error message. The foreign
+     transaction launcher will launch the resolver process again at
+     <xref linkend="guc-foreign-transaction-resolution-rety-interval"/> interval.
+    </para>
+   </sect2>
+  </sect1>
  </chapter>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 3da2365ea9..80a87fa5d1 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -48,6 +48,7 @@
 <!ENTITY wal           SYSTEM "wal.sgml">
 <!ENTITY logical-replication    SYSTEM "logical-replication.sgml">
 <!ENTITY jit    SYSTEM "jit.sgml">
+<!ENTITY distributed-transaction    SYSTEM "distributed-transaction.sgml">
 
 <!-- programmer's guide -->
 <!ENTITY bgworker   SYSTEM "bgworker.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 57a1539506..b9a918b9ee 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -22355,6 +22355,95 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-foreign-transaction">
+   <title>Foreign Transaction Management Functions</title>
+
+   <indexterm>
+    <primary>pg_resolve_foreign_xact</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_remove_foreign_xact</primary>
+   </indexterm>
+
+   <para>
+    <xref linkend="functions-fdw-transaction-control-table"/> shows the functions
+    available for foreign transaction management.
+    These functions cannot be executed during recovery. Use of these function
+    is restricted to superusers.
+   </para>
+
+   <table id="functions-fdw-transaction-control-table">
+    <title>Foreign Transaction Management Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_resolve_foreign_xact(<parameter>transaction</parameter> <type>xid</type>,
<parameter>userid</parameter><type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
 
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>
+        Resolve a foreign transaction. This function searches for foreign
+        transaction matching the arguments and resolves it. Once the foreign
+        transaction is resolved successfully, this function removes the
+        corresponding entry from <xref linkend="view-pg-foreign-xacts"/>.
+        This function won't resolve a foreign transaction which is being
+        processed.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_remove_foreign_xact(<parameter>transaction</parameter> <type>xid</type>,
<parameter>serverid</parameter><type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
 
+       </entry>
+       <entry><type>void</type></entry>
+       <entry>
+        This function works the same as <function>pg_resolve_foreign_xact</function>
+        except that this removes the foreign transcation entry without resolution.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+   The function shown in <xref linkend="functions-fdwxact-resolver-control-table"/>
+   control the foreign transaction resolvers.
+   </para>
+
+   <table id="functions-fdwxact-resolver-control-table">
+    <title>Foreign Transaction Resolver Control Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row>
+       <entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_stop_fdwxact_resolver(<parameter>dbid</parameter>
<type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>
+        Stop the foreign transaction resolver running on the given database.
+        This function is useful for stopping a resolver process on the database
+        that you want to drop.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_stop_fdwxact_resolver</function> is useful to be used before
+    dropping the database to that the foreign transaction resolver is connecting.
+   </para>
+
+  </sect2>
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86b7e..65938e81ca 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -368,6 +368,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+
<entry><structname>pg_stat_foreign_xact</structname><indexterm><primary>pg_stat_fdw_xact_resolver</primary></indexterm></entry>
+      <entry>One row per foreign transaction resolver process, showing statistics about
+       foreign transaction resolution. See <xref linkend="pg-stat-foreign-xact-view"/> for
+       details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -1236,6 +1244,18 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>CheckpointerMain</literal></entry>
          <entry>Waiting in main loop of checkpointer process.</entry>
         </row>
+        <row>
+         <entry><literal>FdwXactLauncherMain</literal></entry>
+         <entry>Waiting in main loop of foreign transaction resolution launcher process.</entry>
+        </row>
+        <row>
+         <entry><literal>FdwXactResolverMain</literal></entry>
+         <entry>Waiting in main loop of foreign transaction resolution worker process.</entry>
+        </row>
+        <row>
+         <entry><literal>LogicalLauncherMain</literal></entry>
+         <entry>Waiting in main loop of logical launcher process.</entry>
+        </row>
         <row>
          <entry><literal>LogicalApplyMain</literal></entry>
          <entry>Waiting in main loop of logical apply process.</entry>
@@ -1459,6 +1479,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>SyncRep</literal></entry>
          <entry>Waiting for confirmation from remote server during synchronous replication.</entry>
         </row>
+        <row>
+         <entry><literal>FdwXactResolution</literal></entry>
+         <entry>Waiting for all foreign transaction participants to be resolved during atomic commit among foreign
servers.</entry>
+        </row>
         <row>
          <entry morerows="2"><literal>Timeout</literal></entry>
          <entry><literal>BaseBackupThrottle</literal></entry>
@@ -2359,6 +2383,42 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connection.
   </para>
 
+  <table id="pg-stat-foreign-xact-view" xreflabel="pg_stat_foreign_xact">
+   <title><structname>pg_stat_foreign_xact</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of a foreign transaction resolver process</entry>
+    </row>
+    <row>
+     <entry><structfield>dbid</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>OID of the database to which the foreign transaction resolver is connected</entry>
+    </row>
+    <row>
+     <entry><structfield>last_resolved_time</structfield></entry>
+     <entry><type>timestamp with time zone</type></entry>
+     <entry>Time at which the process last resolved a foreign transaction</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_fdw_xact_resolver</structname> view will contain one
+   row per foreign transaction resolver process, showing state of resolution
+   of foreign transactions.
+  </para>
 
   <table id="pg-stat-archiver-view" xreflabel="pg_stat_archiver">
    <title><structname>pg_stat_archiver</structname> View</title>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e59cba7997..dee3f72f7e 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -163,6 +163,7 @@
   &wal;
   &logical-replication;
   &jit;
+  &distributed-transaction;
   ®ress;
 
  </part>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..3f4c806ed1 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -83,6 +83,12 @@ Item
   subsystem</entry>
 </row>
 
+<row>
+ <entry><filename>pg_fdwxact</filename></entry>
+ <entry>Subdirectory containing files used by the distributed transaction
+  manager subsystem</entry>
+</row>
+
 <row>
  <entry><filename>pg_logical</filename></entry>
  <entry>Subdirectory containing status data for logical decoding</entry>
-- 
2.23.0

From 84f81fdcb2bd823e34edba79c81c29871d7906fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:15 +0900
Subject: [PATCH v26 4/5] postgres_fdw supports atomic commit APIs.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 contrib/postgres_fdw/Makefile                 |   7 +-
 contrib/postgres_fdw/connection.c             | 604 +++++++++++-------
 .../postgres_fdw/expected/postgres_fdw.out    | 265 +++++++-
 contrib/postgres_fdw/fdwxact.conf             |   3 +
 contrib/postgres_fdw/postgres_fdw.c           |  21 +-
 contrib/postgres_fdw/postgres_fdw.h           |   7 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql     | 120 +++-
 doc/src/sgml/postgres-fdw.sgml                |  45 ++
 8 files changed, 822 insertions(+), 250 deletions(-)
 create mode 100644 contrib/postgres_fdw/fdwxact.conf

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index ee8a80a392..91fa6e39fc 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -16,7 +16,7 @@ SHLIB_LINK_INTERNAL = $(libpq)
 EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
-REGRESS = postgres_fdw
+REGRESSCHECK = postgres_fdw
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -29,3 +29,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+check:
+    $(pg_regress_check) \
+        --temp-config $(top_srcdir)/contrib/postgres_fdw/fdwxact.conf \
+        $(REGRESSCHECK)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 27b86a03f8..0b07e6c5cc 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * connection.c
- *          Connection management functions for postgres_fdw
+ *          Connection and transaction management functions for postgres_fdw
  *
  * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
  *
@@ -12,6 +12,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "catalog/pg_user_mapping.h"
@@ -54,6 +55,7 @@ typedef struct ConnCacheEntry
     bool        have_error;        /* have any subxacts aborted in this xact? */
     bool        changing_xact_state;    /* xact state change in process */
     bool        invalidated;    /* true if reconnect is pending */
+    bool        xact_got_connection;
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
 } ConnCacheEntry;
@@ -67,17 +69,13 @@ static HTAB *ConnectionHash = NULL;
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
-/* tracks whether any work is needed in callback functions */
-static bool xact_got_connection = false;
-
 /* prototypes of private functions */
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values, UserMapping *user);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_subxact_callback(SubXactEvent event,
                                    SubTransactionId mySubid,
                                    SubTransactionId parentSubid,
@@ -89,24 +87,26 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
                                      bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
                                      PGresult **result);
-
+static void pgfdw_end_prepared_xact(ConnCacheEntry *entry, char *fdwxact_id,
+                                    bool is_commit);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
+static ConnCacheEntry *GetConnectionState(Oid umid, bool will_prep_stmt,
+                                          bool start_transaction);
+static ConnCacheEntry *GetConnectionCacheEntry(Oid umid);
 
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
+ * Get connection cache entry. Unlike GetConenctionState function, this function
+ * doesn't establish new connection even if not yet.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
 {
-    bool        found;
     ConnCacheEntry *entry;
-    ConnCacheKey key;
+    ConnCacheKey    key;
+    bool            found;
+
+    /* Create hash key for the entry.  Assume no pad bytes in key struct */
+    key = umid;
 
     /* First time through, initialize connection cache hashtable */
     if (ConnectionHash == NULL)
@@ -126,7 +126,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
          * Register some callback functions that manage connection cleanup.
          * This should be done just once in each backend.
          */
-        RegisterXactCallback(pgfdw_xact_callback, NULL);
         RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
         CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
                                       pgfdw_inval_callback, (Datum) 0);
@@ -134,12 +133,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
                                       pgfdw_inval_callback, (Datum) 0);
     }
 
-    /* Set flag that we did GetConnection during the current transaction */
-    xact_got_connection = true;
-
-    /* Create hash key for the entry.  Assume no pad bytes in key struct */
-    key = user->umid;
-
     /*
      * Find or create cached entry for requested connection.
      */
@@ -153,6 +146,21 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         entry->conn = NULL;
     }
 
+    return entry;
+}
+
+/*
+ * This function gets the connection cache entry and establishes connection
+ * to the foreign server if there is no connection and starts a new transaction
+ * if 'start_transaction' is true.
+ */
+static ConnCacheEntry *
+GetConnectionState(Oid umid, bool will_prep_stmt, bool start_transaction)
+{
+    ConnCacheEntry *entry;
+
+    entry = GetConnectionCacheEntry(umid);
+
     /* Reject further use of connections which failed abort cleanup. */
     pgfdw_reject_incomplete_xact_state_change(entry);
 
@@ -180,6 +188,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
      */
     if (entry->conn == NULL)
     {
+        UserMapping    *user = GetUserMappingByOid(umid);
         ForeignServer *server = GetForeignServer(user->serverid);
 
         /* Reset all transient state fields, to be sure all are clean */
@@ -188,6 +197,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         entry->have_error = false;
         entry->changing_xact_state = false;
         entry->invalidated = false;
+        entry->xact_got_connection = false;
         entry->server_hashvalue =
             GetSysCacheHashValue1(FOREIGNSERVEROID,
                                   ObjectIdGetDatum(server->serverid));
@@ -198,6 +208,15 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         /* Now try to make the connection */
         entry->conn = connect_pg_server(server, user);
 
+        Assert(entry->conn);
+
+        if (!entry->conn)
+        {
+            elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+                 server->servername);
+            return NULL;
+        }
+
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
     }
@@ -205,11 +224,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     /*
      * Start a new transaction or subtransaction if needed.
      */
-    begin_remote_xact(entry);
+    if (start_transaction)
+    {
+        UserMapping        *user = GetUserMappingByOid(umid);
+
+        begin_remote_xact(entry, user->serverid, user->userid);
+
+        /* Set flag that we did GetConnection during the current transaction */
+        entry->xact_got_connection = true;
+    }
 
     /* Remember if caller will prepare statements */
     entry->have_prep_stmt |= will_prep_stmt;
 
+    return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(Oid umid, bool will_prep_stmt, bool start_transaction)
+{
+    ConnCacheEntry *entry;
+
+    entry = GetConnectionState(umid, will_prep_stmt, start_transaction);
+
     return entry->conn;
 }
 
@@ -412,7 +459,7 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
     int            curlevel = GetCurrentTransactionNestLevel();
 
@@ -639,193 +686,6 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
     PG_END_TRY();
 }
 
-/*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
- */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
-{
-    HASH_SEQ_STATUS scan;
-    ConnCacheEntry *entry;
-
-    /* Quick exit if no connections were touched in this transaction. */
-    if (!xact_got_connection)
-        return;
-
-    /*
-     * Scan all connection cache entries to find open remote transactions, and
-     * close them.
-     */
-    hash_seq_init(&scan, ConnectionHash);
-    while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-    {
-        PGresult   *res;
-
-        /* Ignore cache entry if no open connection right now */
-        if (entry->conn == NULL)
-            continue;
-
-        /* If it has an open remote transaction, try to close it */
-        if (entry->xact_depth > 0)
-        {
-            bool        abort_cleanup_failure = false;
-
-            elog(DEBUG3, "closing remote transaction on connection %p",
-                 entry->conn);
-
-            switch (event)
-            {
-                case XACT_EVENT_PARALLEL_PRE_COMMIT:
-                case XACT_EVENT_PRE_COMMIT:
-
-                    /*
-                     * If abort cleanup previously failed for this connection,
-                     * we can't issue any more commands against it.
-                     */
-                    pgfdw_reject_incomplete_xact_state_change(entry);
-
-                    /* Commit all remote transactions during pre-commit */
-                    entry->changing_xact_state = true;
-                    do_sql_command(entry->conn, "COMMIT TRANSACTION");
-                    entry->changing_xact_state = false;
-
-                    /*
-                     * If there were any errors in subtransactions, and we
-                     * made prepared statements, do a DEALLOCATE ALL to make
-                     * sure we get rid of all prepared statements. This is
-                     * annoying and not terribly bulletproof, but it's
-                     * probably not worth trying harder.
-                     *
-                     * DEALLOCATE ALL only exists in 8.3 and later, so this
-                     * constrains how old a server postgres_fdw can
-                     * communicate with.  We intentionally ignore errors in
-                     * the DEALLOCATE, so that we can hobble along to some
-                     * extent with older servers (leaking prepared statements
-                     * as we go; but we don't really support update operations
-                     * pre-8.3 anyway).
-                     */
-                    if (entry->have_prep_stmt && entry->have_error)
-                    {
-                        res = PQexec(entry->conn, "DEALLOCATE ALL");
-                        PQclear(res);
-                    }
-                    entry->have_prep_stmt = false;
-                    entry->have_error = false;
-                    break;
-                case XACT_EVENT_PRE_PREPARE:
-
-                    /*
-                     * We disallow any remote transactions, since it's not
-                     * very reasonable to hold them open until the prepared
-                     * transaction is committed.  For the moment, throw error
-                     * unconditionally; later we might allow read-only cases.
-                     * Note that the error will cause us to come right back
-                     * here with event == XACT_EVENT_ABORT, so we'll clean up
-                     * the connection state at that point.
-                     */
-                    ereport(ERROR,
-                            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                             errmsg("cannot PREPARE a transaction that has operated on postgres_fdw foreign
tables")));
-                    break;
-                case XACT_EVENT_PARALLEL_COMMIT:
-                case XACT_EVENT_COMMIT:
-                case XACT_EVENT_PREPARE:
-                    /* Pre-commit should have closed the open transaction */
-                    elog(ERROR, "missed cleaning up connection during pre-commit");
-                    break;
-                case XACT_EVENT_PARALLEL_ABORT:
-                case XACT_EVENT_ABORT:
-
-                    /*
-                     * Don't try to clean up the connection if we're already
-                     * in error recursion trouble.
-                     */
-                    if (in_error_recursion_trouble())
-                        entry->changing_xact_state = true;
-
-                    /*
-                     * If connection is already unsalvageable, don't touch it
-                     * further.
-                     */
-                    if (entry->changing_xact_state)
-                        break;
-
-                    /*
-                     * Mark this connection as in the process of changing
-                     * transaction state.
-                     */
-                    entry->changing_xact_state = true;
-
-                    /* Assume we might have lost track of prepared statements */
-                    entry->have_error = true;
-
-                    /*
-                     * If a command has been submitted to the remote server by
-                     * using an asynchronous execution function, the command
-                     * might not have yet completed.  Check to see if a
-                     * command is still being processed by the remote server,
-                     * and if so, request cancellation of the command.
-                     */
-                    if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-                        !pgfdw_cancel_query(entry->conn))
-                    {
-                        /* Unable to cancel running query. */
-                        abort_cleanup_failure = true;
-                    }
-                    else if (!pgfdw_exec_cleanup_query(entry->conn,
-                                                       "ABORT TRANSACTION",
-                                                       false))
-                    {
-                        /* Unable to abort remote transaction. */
-                        abort_cleanup_failure = true;
-                    }
-                    else if (entry->have_prep_stmt && entry->have_error &&
-                             !pgfdw_exec_cleanup_query(entry->conn,
-                                                       "DEALLOCATE ALL",
-                                                       true))
-                    {
-                        /* Trouble clearing prepared statements. */
-                        abort_cleanup_failure = true;
-                    }
-                    else
-                    {
-                        entry->have_prep_stmt = false;
-                        entry->have_error = false;
-                    }
-
-                    /* Disarm changing_xact_state if it all worked. */
-                    entry->changing_xact_state = abort_cleanup_failure;
-                    break;
-            }
-        }
-
-        /* Reset state to show we're out of a transaction */
-        entry->xact_depth = 0;
-
-        /*
-         * If the connection isn't in a good idle state, discard it to
-         * recover. Next GetConnection will open a new connection.
-         */
-        if (PQstatus(entry->conn) != CONNECTION_OK ||
-            PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-            entry->changing_xact_state)
-        {
-            elog(DEBUG3, "discarding connection %p", entry->conn);
-            disconnect_pg_server(entry);
-        }
-    }
-
-    /*
-     * Regardless of the event type, we can now mark ourselves as out of the
-     * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-     * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-     */
-    xact_got_connection = false;
-
-    /* Also reset cursor numbering for next transaction */
-    cursor_number = 0;
-}
-
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -842,10 +702,6 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
           event == SUBXACT_EVENT_ABORT_SUB))
         return;
 
-    /* Quick exit if no connections were touched in this transaction. */
-    if (!xact_got_connection)
-        return;
-
     /*
      * Scan all connection cache entries to find open remote subtransactions
      * of the current level, and close them.
@@ -856,6 +712,10 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
     {
         char        sql[100];
 
+        /* Quick exit if no connections were touched in this transaction. */
+        if (!entry->xact_got_connection)
+            continue;
+
         /*
          * We only care about connections with open remote subtransactions of
          * the current level.
@@ -1190,3 +1050,309 @@ exit:    ;
         *result = last_res;
     return timed_out;
 }
+
+/*
+ * Prepare a transaction on foreign server.
+ */
+void
+postgresPrepareForeignTransaction(FdwXactRslvState *state)
+{
+    ConnCacheEntry *entry = NULL;
+    PGresult    *res;
+    StringInfo    command;
+
+    /* The transaction should have started already get the cache entry */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+
+    /* The transaction should have been started */
+    Assert(entry->xact_got_connection && entry->conn);
+
+    pgfdw_reject_incomplete_xact_state_change(entry);
+
+    command = makeStringInfo();
+    appendStringInfo(command, "PREPARE TRANSACTION '%s'", state->fdwxact_id);
+
+    /* Do commit foreign transaction */
+    entry->changing_xact_state = true;
+    res = pgfdw_exec_query(entry->conn, command->data);
+    entry->changing_xact_state = false;
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not prepare transaction on server %s with ID %s",
+                               state->server->servername, state->fdwxact_id)));
+
+    elog(DEBUG1, "prepared foreign transaction on server %s with ID %s",
+         state->server->servername, state->fdwxact_id);
+
+    if (entry->have_prep_stmt && entry->have_error)
+    {
+        res = PQexec(entry->conn, "DEALLOCATE ALL");
+        PQclear(res);
+    }
+
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/*
+ * Commit a transaction or a prepared transaction on foreign server. If
+ * state->flags contains FDWXACT_FLAG_ONEPHASE this function can commit the
+ * foreign transaction without preparation, otherwise commit the prepared
+ * transaction.
+ */
+void
+postgresCommitForeignTransaction(FdwXactRslvState *state)
+{
+    ConnCacheEntry *entry = NULL;
+    bool            is_onephase = (state->flags & FDWXACT_FLAG_ONEPHASE) != 0;
+    PGresult        *res;
+
+    if (!is_onephase)
+    {
+        /*
+         * In two-phase commit case, the foreign transaction has prepared and
+         * closed, so we might not have a connection to it. We get a connection
+         * but don't start transaction.
+         */
+        entry = GetConnectionState(state->usermapping->umid, false, false);
+
+        /* COMMIT PREPARED the transaction */
+        pgfdw_end_prepared_xact(entry, state->fdwxact_id, true);
+        return;
+    }
+
+    /*
+     * In simple commit case, we must have a connection to the foreign server
+     * because the foreign transaction is not closed yet. We get the connection
+     * entry from the cache.
+     */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+    Assert(entry);
+
+    if (!entry->conn || !entry->xact_got_connection)
+        return;
+
+    /*
+     * If abort cleanup previously failed for this connection, we can't issue
+     * any more commands against it.
+     */
+    pgfdw_reject_incomplete_xact_state_change(entry);
+
+    entry->changing_xact_state = true;
+    res = pgfdw_exec_query(entry->conn, "COMMIT TRANSACTION");
+    entry->changing_xact_state = false;
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not commit transaction on server %s",
+                               state->server->servername)));
+
+    /*
+     * If there were any errors in subtransactions, and we ma
+     * made prepared statements, do a DEALLOCATE ALL to make
+     * sure we get rid of all prepared statements. This is
+     * annoying and not terribly bulletproof, but it's
+     * probably not worth trying harder.
+     *
+     * DEALLOCATE ALL only exists in 8.3 and later, so this
+     * constrains how old a server postgres_fdw can
+     * communicate with.  We intentionally ignore errors in
+     * the DEALLOCATE, so that we can hobble along to some
+     * extent with older servers (leaking prepared statements
+     * as we go; but we don't really support update operations
+     * pre-8.3 anyway).
+     */
+    if (entry->have_prep_stmt && entry->have_error)
+    {
+        res = PQexec(entry->conn, "DEALLOCATE ALL");
+        PQclear(res);
+    }
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/*
+ * Rollback a transaction on foreign server. As with commit case, if state->flags
+ * contains FDWAXCT_FLAG_ONEPHASE this function can rollback the foreign
+ * transaction without preparation, other wise rollback the prepared transaction.
+ * This function must tolerate to being called recusively as an error can happen
+ * during aborting.
+ */
+void
+postgresRollbackForeignTransaction(FdwXactRslvState *state)
+{
+    bool            is_onephase = (state->flags & FDWXACT_FLAG_ONEPHASE) != 0;
+    ConnCacheEntry *entry = NULL;
+    bool abort_cleanup_failure = false;
+
+    if (!is_onephase)
+    {
+        /*
+         * In two-phase commit case, the foreign transaction has prepared and
+         * closed, so we might not have a connection to it. We get a connection
+         * but don't start transaction.
+         */
+        entry = GetConnectionState(state->usermapping->umid, false, false);
+
+        /* ROLLBACK PREPARED the transaction */
+        pgfdw_end_prepared_xact(entry, state->fdwxact_id, false);
+        return;
+    }
+
+    /*
+     * In simple rollback case, we must have a connection to the foreign server
+     * because the foreign transaction is not closed yet. We get the connection
+     * entry from the cache.
+     */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+    Assert(entry);
+
+    /*
+     * Cleanup connection entry transaction if transaction fails before
+     * establishing a connection or starting transaction.
+     */
+    if (!entry->conn || !entry->xact_got_connection)
+    {
+        pgfdw_cleanup_after_transaction(entry);
+        return;
+    }
+
+    /*
+     * Don't try to clean up the connection if we're already
+     * in error recursion trouble.
+     */
+    if (in_error_recursion_trouble())
+        entry->changing_xact_state = true;
+
+    /*
+     * If connection is before starting transaction or is already unsalvageable,
+     * do only the cleanup and don't touch it further.
+     */
+    if (entry->changing_xact_state || !entry->xact_got_connection)
+    {
+        pgfdw_cleanup_after_transaction(entry);
+        return;
+    }
+
+    /*
+     * Mark this connection as in the process of changing
+     * transaction state.
+     */
+    entry->changing_xact_state = true;
+
+    /* Assume we might have lost track of prepared statements */
+    entry->have_error = true;
+
+    /*
+     * If a command has been submitted to the remote server by
+     * using an asynchronous execution function, the command
+     * might not have yet completed.  Check to see if a
+     * command is still being processed by the remote server,
+     * and if so, request cancellation of the command.
+     */
+    if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
+        !pgfdw_cancel_query(entry->conn))
+    {
+        /* Unable to cancel running query. */
+        abort_cleanup_failure = true;
+    }
+    else if (!pgfdw_exec_cleanup_query(entry->conn,
+                                       "ABORT TRANSACTION",
+                                       false))
+    {
+        /* Unable to abort remote transaction. */
+        abort_cleanup_failure = true;
+    }
+    else if (entry->have_prep_stmt && entry->have_error &&
+             !pgfdw_exec_cleanup_query(entry->conn,
+                                       "DEALLOCATE ALL",
+                                       true))
+    {
+        /* Trouble clearing prepared statements. */
+        abort_cleanup_failure = true;
+    }
+
+    /* Disarm changing_xact_state if it all worked. */
+    entry->changing_xact_state = abort_cleanup_failure;
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+
+    return;
+}
+
+/*
+ * Commit or rollback prepared transaction on the foreign server.
+ */
+static void
+pgfdw_end_prepared_xact(ConnCacheEntry *entry, char *fdwxact_id, bool is_commit)
+{
+    StringInfo    command;
+    PGresult    *res;
+
+    command = makeStringInfo();
+    appendStringInfo(command, "%s PREPARED '%s'",
+                     is_commit ? "COMMIT" : "ROLLBACK",
+                     fdwxact_id);
+
+    res = pgfdw_exec_query(entry->conn, command->data);
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+    {
+        int        sqlstate;
+        char    *diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+
+        if (diag_sqlstate)
+        {
+            sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+                                     diag_sqlstate[1],
+                                     diag_sqlstate[2],
+                                     diag_sqlstate[3],
+                                     diag_sqlstate[4]);
+        }
+        else
+            sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+        /*
+         * As core global transaction manager states, it's possible that the
+         * given foreign transaction doesn't exist on the foreign server. So
+         * we should accept an UNDEFINED_OBJECT error.
+         */
+        if (sqlstate != ERRCODE_UNDEFINED_OBJECT)
+            pgfdw_report_error(ERROR, res, entry->conn, false, command->data);
+    }
+
+    elog(DEBUG1, "%s prepared foreign transaction with ID %s",
+         is_commit ? "commit" : "rollback",
+         fdwxact_id);
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/* Cleanup at main-transaction end */
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+    /* Reset state to show we're out of a transaction */
+    entry->xact_depth = 0;
+    entry->have_prep_stmt = false;
+    entry->have_error  = false;
+    entry->xact_got_connection = false;
+
+    /*
+     * If the connection isn't in a good idle state, discard it to
+     * recover. Next GetConnection will open a new connection.
+     */
+    if (PQstatus(entry->conn) != CONNECTION_OK ||
+        PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+        entry->changing_xact_state)
+    {
+        elog(DEBUG3, "discarding connection %p", entry->conn);
+        disconnect_pg_server(entry);
+    }
+
+    entry->changing_xact_state = false;
+
+    /* Also reset cursor numbering for next transaction */
+    cursor_number = 0;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 48282ab151..0ee91a49ac 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
     OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,13 @@ CREATE TABLE "S 1"."T 4" (
     c3 text,
     CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 -- Disable autovacuum for these tables to avoid unexpected effects of that
 ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');
 ALTER TABLE "S 1"."T 2" SET (autovacuum_enabled = 'false');
@@ -87,6 +99,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -129,6 +142,12 @@ CREATE FOREIGN TABLE ft6 (
     c2 int NOT NULL,
     c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7_2pc (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8_2pc (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -179,15 +198,17 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                              List of foreign tables
- Schema | Table |  Server   |              FDW options              | Description 
---------+-------+-----------+---------------------------------------+-------------
- public | ft1   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
- public | ft2   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
- public | ft4   | loopback  | (schema_name 'S 1', table_name 'T 3') | 
- public | ft5   | loopback  | (schema_name 'S 1', table_name 'T 4') | 
- public | ft6   | loopback2 | (schema_name 'S 1', table_name 'T 4') | 
-(5 rows)
+                               List of foreign tables
+ Schema |  Table  |  Server   |              FDW options              | Description 
+--------+---------+-----------+---------------------------------------+-------------
+ public | ft1     | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft2     | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft4     | loopback  | (schema_name 'S 1', table_name 'T 3') | 
+ public | ft5     | loopback  | (schema_name 'S 1', table_name 'T 4') | 
+ public | ft6     | loopback2 | (schema_name 'S 1', table_name 'T 4') | 
+ public | ft7_2pc | loopback  | (schema_name 'S 1', table_name 'T 5') | 
+ public | ft8_2pc | loopback2 | (schema_name 'S 1', table_name 'T 5') | 
+(7 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -8781,16 +8802,226 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 
 -- Clean-up
 RESET enable_partitionwise_aggregate;
--- Two-phase transactions are not supported.
+
+-- ===================================================================
+-- test distributed atomic commit across foreign servers
+-- ===================================================================
+-- Enable atomic commit
+SET foreign_twophase_commit TO 'required';
+-- Modify single foreign server and then commit and rollback.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+(1 row)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+(1 row)
+
+-- Modify two servers then commit and rollback. This requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+COMMIT;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- Modify both local data and 2PC-capable server then commit and rollback.
+-- This also requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  3
+(1 row)
+
 BEGIN;
-SELECT count(*) FROM ft1;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  3
+(1 row)
+
+-- Modify foreign server and raise an error. No data changed.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(4);
+INSERT INTO ft8_2pc VALUES(NULL); -- violation
+ERROR:  null value in column "c1" violates not-null constraint
+DETAIL:  Failing row contains (null).
+CONTEXT:  remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES (5);
+INSERT INTO ft8_2pc VALUES (5);
+SAVEPOINT S1;
+INSERT INTO ft7_2pc VALUES (6);
+INSERT INTO ft8_2pc VALUES (6);
+ROLLBACK TO S1;
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+(6 rows)
+
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+(6 rows)
+
+RELEASE SAVEPOINT S1;
+ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks
+-- When set to 'disabled', we can commit it
+SET foreign_twophase_commit TO 'disabled';
+BEGIN;
+INSERT INTO ft7_2pc VALUES(8);
+INSERT INTO ft8_2pc VALUES(8);
+COMMIT; -- success
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+(8 rows)
+
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+(8 rows)
+
+SET foreign_twophase_commit TO 'required';
+-- Commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+  9
+  9
+(10 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+  9
+  9
+(10 rows)
+
+-- No entry remained
+SELECT count(*) FROM pg_foreign_xacts;
  count 
 -------
-   822
+     0
 (1 row)
 
--- error here
-PREPARE TRANSACTION 'fdw_tpc';
-ERROR:  cannot PREPARE a transaction that has operated on postgres_fdw foreign tables
-ROLLBACK;
-WARNING:  there is no transaction in progress
diff --git a/contrib/postgres_fdw/fdwxact.conf b/contrib/postgres_fdw/fdwxact.conf
new file mode 100644
index 0000000000..3fdbf93cdb
--- /dev/null
+++ b/contrib/postgres_fdw/fdwxact.conf
@@ -0,0 +1,3 @@
+max_prepared_transactions = 3
+max_prepared_foreign_transactions = 3
+max_foreign_transaction_resolvers = 2
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index bdc21b36d1..9c63f0aa3b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,7 @@
 
 #include <limits.h>
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -504,7 +505,6 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
                               const PgFdwRelationInfo *fpinfo_o,
                               const PgFdwRelationInfo *fpinfo_i);
 
-
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
  * to my callback routines.
@@ -558,6 +558,11 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for foreign transactions */
+    routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+    routine->CommitForeignTransaction = postgresCommitForeignTransaction;
+    routine->RollbackForeignTransaction = postgresRollbackForeignTransaction;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,7 +1439,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->conn = GetConnection(user->umid, false, true);
 
     /* Assign a unique ID for my cursor */
     fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -2372,7 +2377,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->conn = GetConnection(user->umid, false, true);
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2746,7 +2751,7 @@ estimate_path_cost_size(PlannerInfo *root,
                                 false, &retrieved_attrs, NULL);
 
         /* Get the remote estimate */
-        conn = GetConnection(fpinfo->user, false);
+        conn = GetConnection(fpinfo->user->umid, false, true);
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3566,7 +3571,7 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->conn = GetConnection(user->umid, true, true);
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -4441,7 +4446,7 @@ postgresAnalyzeForeignTable(Relation relation,
      */
     table = GetForeignTable(RelationGetRelid(relation));
     user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-    conn = GetConnection(user, false);
+    conn = GetConnection(user->umid, false, true);
 
     /*
      * Construct command to get page count for relation.
@@ -4527,7 +4532,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
     table = GetForeignTable(RelationGetRelid(relation));
     server = GetForeignServer(table->serverid);
     user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-    conn = GetConnection(user, false);
+    conn = GetConnection(user->umid, false, true);
 
     /*
      * Construct cursor that retrieves whole rows from remote.
@@ -4755,7 +4760,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
      */
     server = GetForeignServer(serverOid);
     mapping = GetUserMapping(GetUserId(), server->serverid);
-    conn = GetConnection(mapping, false);
+    conn = GetConnection(mapping->umid, false, true);
 
     /* Don't attempt to import collation if remote server hasn't got it */
     if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index ea052872c3..d7ba45c8d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdwxact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
@@ -129,7 +130,7 @@ extern int    set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(Oid umid, bool will_prep_stmt, bool start_transaction);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -137,6 +138,9 @@ extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
 extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
                                bool clear, const char *sql);
+extern void postgresPrepareForeignTransaction(FdwXactRslvState *state);
+extern void postgresCommitForeignTransaction(FdwXactRslvState *state);
+extern void postgresRollbackForeignTransaction(FdwXactRslvState *state);
 
 /* in option.c */
 extern int    ExtractConnectionOptions(List *defelems,
@@ -203,6 +207,7 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
                                     bool is_subquery,
                                     List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern bool server_uses_twophase_commit(ForeignServer *server);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1c5c37b783..572077c57c 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
     OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,14 @@ CREATE TABLE "S 1"."T 4" (
     c3 text,
     CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 -- Disable autovacuum for these tables to avoid unexpected effects of that
 ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');
@@ -94,6 +107,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -142,6 +156,15 @@ CREATE FOREIGN TABLE ft6 (
     c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7_2pc (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8_2pc (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -2480,9 +2503,98 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 -- Clean-up
 RESET enable_partitionwise_aggregate;
 
--- Two-phase transactions are not supported.
+-- ===================================================================
+-- test distributed atomic commit across foreign servers
+-- ===================================================================
+
+-- Enable atomic commit
+SET foreign_twophase_commit TO 'required';
+
+-- Modify single foreign server and then commit and rollback.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+COMMIT;
+SELECT * FROM ft7_2pc;
+
 BEGIN;
-SELECT count(*) FROM ft1;
--- error here
-PREPARE TRANSACTION 'fdw_tpc';
+INSERT INTO ft7_2pc VALUES(1);
 ROLLBACK;
+SELECT * FROM ft7_2pc;
+
+-- Modify two servers then commit and rollback. This requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+COMMIT;
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+
+-- Modify both local data and 2PC-capable server then commit and rollback.
+-- This also requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT * FROM ft7_2pc;
+SELECT * FROM "S 1"."T 6";
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+SELECT * FROM "S 1"."T 6";
+
+-- Modify foreign server and raise an error. No data changed.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(4);
+INSERT INTO ft8_2pc VALUES(NULL); -- violation
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES (5);
+INSERT INTO ft8_2pc VALUES (5);
+SAVEPOINT S1;
+INSERT INTO ft7_2pc VALUES (6);
+INSERT INTO ft8_2pc VALUES (6);
+ROLLBACK TO S1;
+COMMIT;
+SELECT * FROM ft7_2pc;
+SELECT * FROM ft8_2pc;
+RELEASE SAVEPOINT S1;
+
+-- When set to 'disabled', we can commit it
+SET foreign_twophase_commit TO 'disabled';
+BEGIN;
+INSERT INTO ft7_2pc VALUES(8);
+INSERT INTO ft8_2pc VALUES(8);
+COMMIT; -- success
+SELECT * FROM ft7_2pc;
+SELECT * FROM ft8_2pc;
+
+SET foreign_twophase_commit TO 'required';
+
+-- Commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+
+-- No entry remained
+SELECT count(*) FROM pg_foreign_xacts;
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 1d4bafd9f0..362f7be9e3 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -441,6 +441,43 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be specified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"/> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions"/>
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </sect3>
  </sect2>
 
  <sect2>
@@ -468,6 +505,14 @@
    managed by creating corresponding remote savepoints.
   </para>
 
+  <para>
+   <filename>postgrs_fdw</filename> uses two-phase commit protocol during
+   transaction commits or aborts when the atomic commit of distributed
+   transaction (see <xref linkend="atomic-commit"/>) is required. So the remote
+   server should set <xref linkend="guc-max-prepared-transactions"/> more
+   than one so that it can prepare the remote transaction.
+  </para>
+
   <para>
    The remote transaction uses <literal>SERIALIZABLE</literal>
    isolation level when the local transaction has <literal>SERIALIZABLE</literal>
-- 
2.23.0

From 639d9156323594430ec4b2217a95bfcf08195e9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:26 +0900
Subject: [PATCH v26 5/5] Add regression tests for atomic commit.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/test/recovery/Makefile         |   2 +-
 src/test/recovery/t/016_fdwxact.pl | 175 +++++++++++++++++++++++++++++
 src/test/regress/pg_regress.c      |  13 ++-
 3 files changed, 185 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/016_fdwxact.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index e66e69521f..b17429f501 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pageinspect contrib/postgres_fdw
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/016_fdwxact.pl b/src/test/recovery/t/016_fdwxact.pl
new file mode 100644
index 0000000000..9af9bb81dc
--- /dev/null
+++ b/src/test/recovery/t/016_fdwxact.pl
@@ -0,0 +1,175 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
+max_foreign_transaction_resolvers = 2
+foreign_transaction_resolver_timeout = 0
+foreign_transaction_resolution_retry_interval = 5s
+foreign_twophase_commit = on
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+                               has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs2->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign servers on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE EXTENSION postgres_fdw
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port');
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port');
+));
+
+# Create user mapping on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+));
+
+# Create tables on foreign nodes and import them to the master node
+$node_fs1->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 (c int);
+));
+$node_fs2->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 (c int);
+));
+$node_master->safe_psql('postgres', qq(
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE l_table (c int);
+));
+
+# Switch to synchronous replication
+$node_master->safe_psql('postgres', qq(
+ALTER SYSTEM SET synchronous_standby_names ='*';
+));
+$node_master->reload;
+
+my $result;
+
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node. Check if we can commit and rollback the foreign transactions
+# after the normal recovery.
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->stop;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after recovery');
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node immediately. Check if we can commit and rollback the foreign
+# transactions after the crash recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (3);
+INSERT INTO t2 VALUES (3);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (4);
+INSERT INTO t2 VALUES (4);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the crash recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after crash recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after crash recovery');
+
+#
+# Commit transaction involving foreign servers and shutdown the master node
+# immediately before checkpoint. Check that WAL replay cleans up
+# its shared memory state release locks while replaying transaction commit.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (5);
+INSERT INTO t2 VALUES (5);
+COMMIT;
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', qq(
+SELECT count(*) FROM pg_foreign_xacts;
+));
+is($result, 0, "Cleanup of shared memory state for foreign transactions");
+
+#
+# Check if the standby node can process prepared foreign transaction
+# after promotion.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (6);
+INSERT INTO t2 VALUES (6);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (7);
+INSERT INTO t2 VALUES (7);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', qq(COMMIT PREPARED 'gxid1';));
+is($result, 0, 'Commit foreign transaction after promotion');
+$result = $node_standby->psql('postgres', qq(ROLLBACK PREPARED 'gxid2';));
+is($result, 0, 'Rollback foreign transaction after promotion');
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..82a1e7d541 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2336,9 +2336,12 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
          * Adjust the default postgresql.conf for regression testing. The user
          * can specify a file to be appended; in any case we expand logging
          * and set max_prepared_transactions to enable testing of prepared
-         * xacts.  (Note: to reduce the probability of unexpected shmmax
-         * failures, don't set max_prepared_transactions any higher than
-         * actually needed by the prepared_xacts regression test.)
+         * xacts.  We also set max_prepared_foreign_transactions and
+         * max_foreign_transaction_resolvers to enable testing of transaction
+         * involving multiple foreign servers. (Note: to reduce the probability
+         * of unexpected shmmax failures, don't set max_prepared_transactions
+         * any higher than actually needed by the prepared_xacts regression
+         * test.)
          */
         snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
         pg_conf = fopen(buf, "a");
@@ -2353,7 +2356,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
         fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
         fputs("log_lock_waits = on\n", pg_conf);
         fputs("log_temp_files = 128kB\n", pg_conf);
-        fputs("max_prepared_transactions = 2\n", pg_conf);
+        fputs("max_prepared_transactions = 3\n", pg_conf);
+        fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
+        fputs("max_foreign_transaction_resolvers = 2\n", pg_conf);
 
         for (sl = temp_configs; sl != NULL; sl = sl->next)
         {
-- 
2.23.0


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> This is the reased (and a bit fixed) version of the patch. This
> applies on the master HEAD and passes all provided tests.
>
> I took over this work from Sawada-san. I'll begin with reviewing the
> current patch.
>

The previous patch set is no longer applied cleanly to the current
HEAD. I've updated and slightly modified the codes.

This patch set has been marked as Waiting on Author for a long time
but the correct status now is Needs Review. The patch actually was
updated and incorporated all review comments but they was not rebased
actively.

The mail[1] I posted before would be helpful to understand the current
patch design and there are README in the patch and a wiki page[2].

I've marked this as Needs Review.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDn98axH1bEoMnte%2BS7WWR%3DnsmOpjz1WGH-NvJi4aLu3Q%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
amul sul
Date:
On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> This is the reased (and a bit fixed) version of the patch. This
> applies on the master HEAD and passes all provided tests.
>
> I took over this work from Sawada-san. I'll begin with reviewing the
> current patch.
>

The previous patch set is no longer applied cleanly to the current
HEAD. I've updated and slightly modified the codes.

This patch set has been marked as Waiting on Author for a long time
but the correct status now is Needs Review. The patch actually was
updated and incorporated all review comments but they was not rebased
actively.

The mail[1] I posted before would be helpful to understand the current
patch design and there are README in the patch and a wiki page[2].

I've marked this as Needs Review.


Hi Sawada san,

I just had a quick look to 0001 and 0002 patch here is the few suggestions.

patch: v27-0001:

Typo: s/non-temprary/non-temporary
----

patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):

 138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
 139 +participant The registered foreign transactions are tracked until the end of

Full stop "." is missing after "participant"


174 +API Contract With Transaction Management Callback Functions

Can we just say "Transaction Management Callback Functions"; 
TOBH, I am not sure that I understand this title.


 203 +processing foreign transaction (i.g. preparing, committing or aborting) the

Do you mean "i.e" instead of i.g. ?


269 + * RollbackForeignTransactionAPI. Registered participant servers are identified

Add space before between RollbackForeignTransaction and API.


 292 + * automatically so must be processed manually using by pg_resovle_fdwxact()

Do you mean pg_resolve_foreign_xact() here?


 320 + *   the foreign transaction is authorized to update the fields from its own
 321 + *   one.
 322 +
 323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a

Please add asterisk '*' on line#322.


 816 +static void
 817 +FdwXactPrepareForeignTransactions(void)
 818 +{
 819 +   ListCell        *lcell;

Let's have this variable name as "lc" like elsewhere.


1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
1038 +                                  xid, serverid, userid)));
1039 +   }

Incorrect formatting.

 
1166 +/*
1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction

Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?


3529 +
3530 +/*
3531 + * FdwXactLauncherRegister
3532 + *     Register a background worker running the foreign transaction
3533 + *      launcher.
3534 + */

This prolog style is not consistent with the other function in the file.


And here are the few typos:

s/conssitent/consistent
s/consisnts/consist
s/Foriegn/Foreign
s/tranascation/transaction
s/itselft/itself
s/rolbacked/rollbacked
s/trasaction/transaction
s/transactio/transaction
s/automically/automatically
s/CommitForeignTransaciton/CommitForeignTransaction
s/Similary/Similarly
s/FDWACT_/FDWXACT_
s/dink/disk
s/requried/required
s/trasactions/transactions
s/prepread/prepared
s/preapred/prepared
s/beging/being
s/gxact/xact
s/in-dbout/in-doubt
s/respecitively/respectively
s/transction/transaction
s/idenetifier/identifier
s/identifer/identifier
s/checkpoint'S/checkpoint's
s/fo/of
s/transcation/transaction
s/trasanction/transaction
s/non-temprary/non-temporary
s/resovler_internal.h/resolver_internal.h


Regards,
Amul

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Muhammad Usama
Date:
Hi Sawada San,

I have a couple of comments on "v27-0002-Support-atomic-commit-among-multiple-foreign-ser.patch"

1-  As part of the XLogReadRecord refactoring commit the signature of XLogReadRecord was changed,
so the function call to XLogReadRecord() needs a small adjustment.

i.e. In function XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
...
-       record = XLogReadRecord(xlogreader, lsn, &errormsg);
+       XLogBeginRead(xlogreader, lsn)
+       record = XLogReadRecord(xlogreader, &errormsg);

2- In register_fdwxact(..) function you are setting the XACT_FLAGS_FDWNOPREPARE transaction flag
when the register request comes in for foreign server that does not support two-phase commit regardless
of the value of 'bool modified' argument. And later in the PreCommit_FdwXacts() you just error out when
"foreign_twophase_commit" is set to 'required' only by looking at the XACT_FLAGS_FDWNOPREPARE flag.
which I think is not correct.
Since there is a possibility that the transaction might have only read from the foreign servers (not capable of
handling transactions or two-phase commit) and all other servers where we require to do atomic commit
are capable enough of doing so.
If I am not missing something obvious here, then IMHO the XACT_FLAGS_FDWNOPREPARE flag should only
be set when the transaction management/two-phase functionality is not available and "modified" argument is
true in register_fdwxact()

Thanks

Best regards
Muhammad Usama
Highgo Software (Canada/China/Pakistan)

The new status of this patch is: Waiting on Author

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
>
> On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>> >
>> > Hello.
>> >
>> > This is the reased (and a bit fixed) version of the patch. This
>> > applies on the master HEAD and passes all provided tests.
>> >
>> > I took over this work from Sawada-san. I'll begin with reviewing the
>> > current patch.
>> >
>>
>> The previous patch set is no longer applied cleanly to the current
>> HEAD. I've updated and slightly modified the codes.
>>
>> This patch set has been marked as Waiting on Author for a long time
>> but the correct status now is Needs Review. The patch actually was
>> updated and incorporated all review comments but they was not rebased
>> actively.
>>
>> The mail[1] I posted before would be helpful to understand the current
>> patch design and there are README in the patch and a wiki page[2].
>>
>> I've marked this as Needs Review.
>>
>
> Hi Sawada san,
>
> I just had a quick look to 0001 and 0002 patch here is the few suggestions.
>
> patch: v27-0001:
>
> Typo: s/non-temprary/non-temporary
> ----
>
> patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
>
>  138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
>  139 +participant The registered foreign transactions are tracked until the end of
>
> Full stop "." is missing after "participant"
>
>
> 174 +API Contract With Transaction Management Callback Functions
>
> Can we just say "Transaction Management Callback Functions";
> TOBH, I am not sure that I understand this title.
>
>
>  203 +processing foreign transaction (i.g. preparing, committing or aborting) the
>
> Do you mean "i.e" instead of i.g. ?
>
>
> 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
>
> Add space before between RollbackForeignTransaction and API.
>
>
>  292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
>
> Do you mean pg_resolve_foreign_xact() here?
>
>
>  320 + *   the foreign transaction is authorized to update the fields from its own
>  321 + *   one.
>  322 +
>  323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
>
> Please add asterisk '*' on line#322.
>
>
>  816 +static void
>  817 +FdwXactPrepareForeignTransactions(void)
>  818 +{
>  819 +   ListCell        *lcell;
>
> Let's have this variable name as "lc" like elsewhere.
>
>
> 1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> 1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> 1038 +                                  xid, serverid, userid)));
> 1039 +   }
>
> Incorrect formatting.
>
>
> 1166 +/*
> 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
>
> Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
>
>
> 3529 +
> 3530 +/*
> 3531 + * FdwXactLauncherRegister
> 3532 + *     Register a background worker running the foreign transaction
> 3533 + *      launcher.
> 3534 + */
>
> This prolog style is not consistent with the other function in the file.
>
>
> And here are the few typos:
>
> s/conssitent/consistent
> s/consisnts/consist
> s/Foriegn/Foreign
> s/tranascation/transaction
> s/itselft/itself
> s/rolbacked/rollbacked
> s/trasaction/transaction
> s/transactio/transaction
> s/automically/automatically
> s/CommitForeignTransaciton/CommitForeignTransaction
> s/Similary/Similarly
> s/FDWACT_/FDWXACT_
> s/dink/disk
> s/requried/required
> s/trasactions/transactions
> s/prepread/prepared
> s/preapred/prepared
> s/beging/being
> s/gxact/xact
> s/in-dbout/in-doubt
> s/respecitively/respectively
> s/transction/transaction
> s/idenetifier/identifier
> s/identifer/identifier
> s/checkpoint'S/checkpoint's
> s/fo/of
> s/transcation/transaction
> s/trasanction/transaction
> s/non-temprary/non-temporary
> s/resovler_internal.h/resolver_internal.h
>
>

Thank you for reviewing the patch! I've incorporated all comments in
local branch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 19 Feb 2020 at 07:55, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >> >
> >> > Hello.
> >> >
> >> > This is the reased (and a bit fixed) version of the patch. This
> >> > applies on the master HEAD and passes all provided tests.
> >> >
> >> > I took over this work from Sawada-san. I'll begin with reviewing the
> >> > current patch.
> >> >
> >>
> >> The previous patch set is no longer applied cleanly to the current
> >> HEAD. I've updated and slightly modified the codes.
> >>
> >> This patch set has been marked as Waiting on Author for a long time
> >> but the correct status now is Needs Review. The patch actually was
> >> updated and incorporated all review comments but they was not rebased
> >> actively.
> >>
> >> The mail[1] I posted before would be helpful to understand the current
> >> patch design and there are README in the patch and a wiki page[2].
> >>
> >> I've marked this as Needs Review.
> >>
> >
> > Hi Sawada san,
> >
> > I just had a quick look to 0001 and 0002 patch here is the few suggestions.
> >
> > patch: v27-0001:
> >
> > Typo: s/non-temprary/non-temporary
> > ----
> >
> > patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
> >
> >  138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
> >  139 +participant The registered foreign transactions are tracked until the end of
> >
> > Full stop "." is missing after "participant"
> >
> >
> > 174 +API Contract With Transaction Management Callback Functions
> >
> > Can we just say "Transaction Management Callback Functions";
> > TOBH, I am not sure that I understand this title.
> >
> >
> >  203 +processing foreign transaction (i.g. preparing, committing or aborting) the
> >
> > Do you mean "i.e" instead of i.g. ?
> >
> >
> > 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
> >
> > Add space before between RollbackForeignTransaction and API.
> >
> >
> >  292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
> >
> > Do you mean pg_resolve_foreign_xact() here?
> >
> >
> >  320 + *   the foreign transaction is authorized to update the fields from its own
> >  321 + *   one.
> >  322 +
> >  323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
> >
> > Please add asterisk '*' on line#322.
> >
> >
> >  816 +static void
> >  817 +FdwXactPrepareForeignTransactions(void)
> >  818 +{
> >  819 +   ListCell        *lcell;
> >
> > Let's have this variable name as "lc" like elsewhere.
> >
> >
> > 1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> > 1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> > 1038 +                                  xid, serverid, userid)));
> > 1039 +   }
> >
> > Incorrect formatting.
> >
> >
> > 1166 +/*
> > 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
> >
> > Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
> >
> >
> > 3529 +
> > 3530 +/*
> > 3531 + * FdwXactLauncherRegister
> > 3532 + *     Register a background worker running the foreign transaction
> > 3533 + *      launcher.
> > 3534 + */
> >
> > This prolog style is not consistent with the other function in the file.
> >
> >
> > And here are the few typos:
> >
> > s/conssitent/consistent
> > s/consisnts/consist
> > s/Foriegn/Foreign
> > s/tranascation/transaction
> > s/itselft/itself
> > s/rolbacked/rollbacked
> > s/trasaction/transaction
> > s/transactio/transaction
> > s/automically/automatically
> > s/CommitForeignTransaciton/CommitForeignTransaction
> > s/Similary/Similarly
> > s/FDWACT_/FDWXACT_
> > s/dink/disk
> > s/requried/required
> > s/trasactions/transactions
> > s/prepread/prepared
> > s/preapred/prepared
> > s/beging/being
> > s/gxact/xact
> > s/in-dbout/in-doubt
> > s/respecitively/respectively
> > s/transction/transaction
> > s/idenetifier/identifier
> > s/identifer/identifier
> > s/checkpoint'S/checkpoint's
> > s/fo/of
> > s/transcation/transaction
> > s/trasanction/transaction
> > s/non-temprary/non-temporary
> > s/resovler_internal.h/resolver_internal.h
> >
> >
>
> Thank you for reviewing the patch! I've incorporated all comments in
> local branch.

Attached the updated version patch sets that incorporated review
comments from Amul and Muhammad.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From
Masahiko Sawada
Date:
On Tue, 18 Feb 2020 at 00:40, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have a couple of comments on "v27-0002-Support-atomic-commit-among-multiple-foreign-ser.patch"
>
> 1-  As part of the XLogReadRecord refactoring commit the signature of XLogReadRecord was changed,
> so the function call to XLogReadRecord() needs a small adjustment.
>
> i.e. In function XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
> ...
> -       record = XLogReadRecord(xlogreader, lsn, &errormsg);
> +       XLogBeginRead(xlogreader, lsn)
> +       record = XLogReadRecord(xlogreader, &errormsg);
>
> 2- In register_fdwxact(..) function you are setting the XACT_FLAGS_FDWNOPREPARE transaction flag
> when the register request comes in for foreign server that does not support two-phase commit regardless
> of the value of 'bool modified' argument. And later in the PreCommit_FdwXacts() you just error out when
> "foreign_twophase_commit" is set to 'required' only by looking at the XACT_FLAGS_FDWNOPREPARE flag.
> which I think is not correct.
> Since there is a possibility that the transaction might have only read from the foreign servers (not capable of
> handling transactions or two-phase commit) and all other servers where we require to do atomic commit
> are capable enough of doing so.
> If I am not missing something obvious here, then IMHO the XACT_FLAGS_FDWNOPREPARE flag should only
> be set when the transaction management/two-phase functionality is not available and "modified" argument is
> true in register_fdwxact()
>

Thank you for reviewing this patch!

Your comments are incorporated in the latest patch set I recently sent[1].

[1] https://www.postgresql.org/message-id/CA%2Bfd4k5ZcDvoiY_5c-mF1oDACS5nUWS7ppoiOwjCOnM%2BgrJO-Q%40mail.gmail.com

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:

On Sat, Feb 22, 2020 at 7:15 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Wed, 19 Feb 2020 at 07:55, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >> >
> >> > Hello.
> >> >
> >> > This is the reased (and a bit fixed) version of the patch. This
> >> > applies on the master HEAD and passes all provided tests.
> >> >
> >> > I took over this work from Sawada-san. I'll begin with reviewing the
> >> > current patch.
> >> >
> >>
> >> The previous patch set is no longer applied cleanly to the current
> >> HEAD. I've updated and slightly modified the codes.
> >>
> >> This patch set has been marked as Waiting on Author for a long time
> >> but the correct status now is Needs Review. The patch actually was
> >> updated and incorporated all review comments but they was not rebased
> >> actively.
> >>
> >> The mail[1] I posted before would be helpful to understand the current
> >> patch design and there are README in the patch and a wiki page[2].
> >>
> >> I've marked this as Needs Review.
> >>
> >
> > Hi Sawada san,
> >
> > I just had a quick look to 0001 and 0002 patch here is the few suggestions.
> >
> > patch: v27-0001:
> >
> > Typo: s/non-temprary/non-temporary
> > ----
> >
> > patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
> >
> >  138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
> >  139 +participant The registered foreign transactions are tracked until the end of
> >
> > Full stop "." is missing after "participant"
> >
> >
> > 174 +API Contract With Transaction Management Callback Functions
> >
> > Can we just say "Transaction Management Callback Functions";
> > TOBH, I am not sure that I understand this title.
> >
> >
> >  203 +processing foreign transaction (i.g. preparing, committing or aborting) the
> >
> > Do you mean "i.e" instead of i.g. ?
> >
> >
> > 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
> >
> > Add space before between RollbackForeignTransaction and API.
> >
> >
> >  292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
> >
> > Do you mean pg_resolve_foreign_xact() here?
> >
> >
> >  320 + *   the foreign transaction is authorized to update the fields from its own
> >  321 + *   one.
> >  322 +
> >  323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
> >
> > Please add asterisk '*' on line#322.
> >
> >
> >  816 +static void
> >  817 +FdwXactPrepareForeignTransactions(void)
> >  818 +{
> >  819 +   ListCell        *lcell;
> >
> > Let's have this variable name as "lc" like elsewhere.
> >
> >
> > 1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> > 1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> > 1038 +                                  xid, serverid, userid)));
> > 1039 +   }
> >
> > Incorrect formatting.
> >
> >
> > 1166 +/*
> > 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
> >
> > Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
> >
> >
> > 3529 +
> > 3530 +/*
> > 3531 + * FdwXactLauncherRegister
> > 3532 + *     Register a background worker running the foreign transaction
> > 3533 + *      launcher.
> > 3534 + */
> >
> > This prolog style is not consistent with the other function in the file.
> >
> >
> > And here are the few typos:
> >
> > s/conssitent/consistent
> > s/consisnts/consist
> > s/Foriegn/Foreign
> > s/tranascation/transaction
> > s/itselft/itself
> > s/rolbacked/rollbacked
> > s/trasaction/transaction
> > s/transactio/transaction
> > s/automically/automatically
> > s/CommitForeignTransaciton/CommitForeignTransaction
> > s/Similary/Similarly
> > s/FDWACT_/FDWXACT_
> > s/dink/disk
> > s/requried/required
> > s/trasactions/transactions
> > s/prepread/prepared
> > s/preapred/prepared
> > s/beging/being
> > s/gxact/xact
> > s/in-dbout/in-doubt
> > s/respecitively/respectively
> > s/transction/transaction
> > s/idenetifier/identifier
> > s/identifer/identifier
> > s/checkpoint'S/checkpoint's
> > s/fo/of
> > s/transcation/transaction
> > s/trasanction/transaction
> > s/non-temprary/non-temporary
> > s/resovler_internal.h/resolver_internal.h
> >
> >
>
> Thank you for reviewing the patch! I've incorporated all comments in
> local branch.

Attached the updated version patch sets that incorporated review
comments from Amul and Muhammad.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Sawada San,

I have been further reviewing and testing the transaction involving multiple server patches.
Overall the patches are working as expected bar a few important exceptions.
So as discussed over the call I have fixed the issues I found during the testing
and also rebased the patches with the current head of the master branch. 
So can you please have a look at the attached updated patches.

Below is the list of changes I have made on top of V18 patches.

1- In register_fdwxact(), As we are just storing the callback function pointers from
FdwRoutine in fdw_part structure, So I think we can avoid calling
GetFdwRoutineByServerId() in TopMemoryContext.
So I have moved the MemoryContextSwitch to TopMemoryContext after the
GetFdwRoutineByServerId() call.


2- If PrepareForeignTransaction functionality is not present in some FDW then
during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
transaction flag if the modified flag is also set for that server. As for the server that has
not done any data modification within the transaction we do not do two-phase commit anyway.

3- I have moved the foreign_twophase_commit in sample file after
max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
is 0 and enabling the foreign_twophase_commit produces an error with default
configuration parameter positioning in postgresql.conf
Also, foreign_twophase_commit configuration was missing the comments
about allowed values in the sample config file.

4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
function does not seem to be the correct place. The reason being, even when
is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
to true, we could still end up not using the two-phase commit in the case when some server does
not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
function after doing the prepare transaction.

6- In prefer mode, we commit the transaction in single-phase if the server does not support
the two-phase commit. But instead of doing the single-phase commit right away,
IMHO the better way is to wait until all the two-phase transactions are successfully prepared
on servers that support the two-phase. Since an error during a "PREPARE" stage would
rollback the transaction and in that case, we would end up with committed transactions on
the server that lacks the support of the two-phase commit.
So I have modified the flow a little bit and instead of doing a one-phase commit right away
the servers that do not support a two-phase commit is added to another list and that list is
processed after once we have successfully prepared all the transactions on two-phase supported
foreign servers. Although this technique is also not bulletproof, still it is better than doing
the one-phase commits before doing the PREPAREs.

Also, I think we can improve on this one by throwing an error even in PREFER
mode if there is more than one server that had data modified within the transaction
and lacks the two-phase commit support.

7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
memory if fdw_part is removed from the list

8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
(FdwXactParticipants == NIL). The problem with that was in the case of
"COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
effectively the foreign prepared transactions(if any) associated with locally
prepared transactions were never getting resolved automatically.


postgres=# BEGIN;
BEGIN
INSERT INTO test_local  VALUES ( 2, 'TWO');
INSERT 0 1
INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
INSERT 0 1
INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
INSERT 0 1
postgres=*# PREPARE TRANSACTION 'local_prepared';
PREPARE TRANSACTION

postgres=# select * from pg_foreign_xacts ; 
dbid  | xid | serverid | userid |  status  | in_doubt |         identifier        
-------+-----+----------+--------+----------+----------+----------------------------
 12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
 12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
(2 rows)

-- Now commit the prepared transaction

postgres=# COMMIT PREPARED 'local_prepared';                                                                                                                  

COMMIT PREPARED

--Foreign prepared transactions associated with 'local_prepared' not resolved

postgres=#

postgres=# select * from pg_foreign_xacts ; 
dbid  | xid | serverid | userid |  status  | in_doubt |         identifier        
-------+-----+----------+--------+----------+----------+----------------------------
 12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
 12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
(2 rows)


So to fix this in case of the two-phase transaction, the function checks the existence
of associated foreign prepared transactions before bailing out.

9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
that was causing the crash during recovery.

10- incorporated set_ps_display() signature change.


Best regards,

...
Muhammad Usama
HighGo Software (Canada/China/Pakistan) 
 
Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have been further reviewing and testing the transaction involving multiple server patches.
> Overall the patches are working as expected bar a few important exceptions.
> So as discussed over the call I have fixed the issues I found during the testing
> and also rebased the patches with the current head of the master branch.
> So can you please have a look at the attached updated patches.

Thank you for reviewing and updating the patch!

>
> Below is the list of changes I have made on top of V18 patches.
>
> 1- In register_fdwxact(), As we are just storing the callback function pointers from
> FdwRoutine in fdw_part structure, So I think we can avoid calling
> GetFdwRoutineByServerId() in TopMemoryContext.
> So I have moved the MemoryContextSwitch to TopMemoryContext after the
> GetFdwRoutineByServerId() call.

Agreed.

>
>
> 2- If PrepareForeignTransaction functionality is not present in some FDW then
> during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> transaction flag if the modified flag is also set for that server. As for the server that has
> not done any data modification within the transaction we do not do two-phase commit anyway.

Agreed.

>
> 3- I have moved the foreign_twophase_commit in sample file after
> max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> is 0 and enabling the foreign_twophase_commit produces an error with default
> configuration parameter positioning in postgresql.conf
> Also, foreign_twophase_commit configuration was missing the comments
> about allowed values in the sample config file.

Sounds good. Agreed.

>
> 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> function does not seem to be the correct place. The reason being, even when
> is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> to true, we could still end up not using the two-phase commit in the case when some server does
> not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> function after doing the prepare transaction.

Agreed.

>
> 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> the two-phase commit. But instead of doing the single-phase commit right away,
> IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> on servers that support the two-phase. Since an error during a "PREPARE" stage would
> rollback the transaction and in that case, we would end up with committed transactions on
> the server that lacks the support of the two-phase commit.

When an error occurred before the local commit, a 2pc-unsupported
server could be rolled back or committed depending on the error
timing. On the other hand all 2pc-supported servers are always rolled
back when an error occurred before the local commit. Therefore even if
we change the order of COMMIT and PREPARE it is still possible that we
will end up committing the part of 2pc-unsupported servers while
rolling back others including 2pc-supported servers.

I guess the motivation of your change is that since errors are likely
to happen during executing PREPARE on foreign servers, we can minimize
the possibility of rolling back 2pc-unsupported servers by deferring
the commit of 2pc-unsupported server as much as possible. Is that
right?

> So I have modified the flow a little bit and instead of doing a one-phase commit right away
> the servers that do not support a two-phase commit is added to another list and that list is
> processed after once we have successfully prepared all the transactions on two-phase supported
> foreign servers. Although this technique is also not bulletproof, still it is better than doing
> the one-phase commits before doing the PREPAREs.

Hmm the current logic seems complex. Maybe we can just reverse the
order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
modified servers first and then do COMMIT on others?

>
> Also, I think we can improve on this one by throwing an error even in PREFER
> mode if there is more than one server that had data modified within the transaction
> and lacks the two-phase commit support.
>

IIUC the concept of PREFER mode is that the transaction uses 2pc only
for 2pc-supported servers. IOW, even if the transaction modifies on a
2pc-unsupported server we can proceed with the commit if in PREFER
mode, which cannot if in REQUIRED mode. What is the motivation of your
above idea?

> 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> memory if fdw_part is removed from the list

I think at the end of the transaction we free entries of
FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
need to do that in PreCommit_FdwXacts()?

>
> 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> (FdwXactParticipants == NIL). The problem with that was in the case of
> "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> effectively the foreign prepared transactions(if any) associated with locally
> prepared transactions were never getting resolved automatically.
>
>
> postgres=# BEGIN;
> BEGIN
> INSERT INTO test_local  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> INSERT 0 1
> postgres=*# PREPARE TRANSACTION 'local_prepared';
> PREPARE TRANSACTION
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
> -- Now commit the prepared transaction
>
> postgres=# COMMIT PREPARED 'local_prepared';
>
> COMMIT PREPARED
>
> --Foreign prepared transactions associated with 'local_prepared' not resolved
>
> postgres=#
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
>
> So to fix this in case of the two-phase transaction, the function checks the existence
> of associated foreign prepared transactions before bailing out.
>

Good catch. But looking at your change, we should not accept the case
where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
false.

       if (FdwXactParticipants == NIL)
       {
               /*
                * If we are here because of COMMIT/ROLLBACK PREPARED then the
                * FdwXactParticipants list would be empty. So we need to
                * see if there are any foreign prepared transactions exists
                * for this prepared transaction
                */
               if (TwoPhaseExists(wait_xid))
               {
                       List *foreign_trans = NIL;

                       foreign_trans = get_fdwxacts(MyDatabaseId,
wait_xid, InvalidOid, InvalidOid,
                                        false, false, true);

                       if (foreign_trans == NIL)
                               return;
                       list_free(foreign_trans);
               }
       }

> 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> that was causing the crash during recovery.

Agreed.

>
> 10- incorporated set_ps_display() signature change.

Thanks.

Regarding other changes you did in v19 patch, I have some comments:

1.
+       ereport(LOG,
+                       (errmsg("trying to %s the foreign transaction
associated with transaction %u on server %u",
+                                       fdwxact->status ==
FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
+                                       fdwxact->local_xid,
fdwxact->serverid)));
+

Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?

2.
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
deleted file mode 120000
index ce8c21880c..0000000000
--- a/src/bin/pg_waldump/fdwxactdesc.c
+++ /dev/null
@@ -1 +0,0 @@
-../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 100644
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c

We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.

3.
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
  Ss   11:34   0:00 postgres: ser
          <entry><literal>SafeSnapshot</literal></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY
DEFERRABLE</literal> transaction.</entry>
         </row>
-        <row>
-         <entry><literal>SyncRep</literal></entry>
-         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
-        </row>
         <row>
          <entry><literal>FdwXactResolution</literal></entry>
          <entry>Waiting for all foreign transaction participants to
be resolved during atomic commit among foreign servers.</entry>
         </row>
+        <row>
+         <entry><literal>SyncRep</literal></entry>
+         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
+        </row>
         <row>
          <entry morerows="4"><literal>Timeout</literal></entry>
          <entry><literal>BaseBackupThrottle</literal></entry>

We need to move the entry of FdwXactResolution to right before
Hash/Batch/Allocating for alphabetical order.

I've incorporated your changes I agreed with to my local branch and
will incorporate other changes after discussion. I'll also do more
test and self-review and will submit the latest version patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have been further reviewing and testing the transaction involving multiple server patches.
> Overall the patches are working as expected bar a few important exceptions.
> So as discussed over the call I have fixed the issues I found during the testing
> and also rebased the patches with the current head of the master branch.
> So can you please have a look at the attached updated patches.

Thank you for reviewing and updating the patch!

>
> Below is the list of changes I have made on top of V18 patches.
>
> 1- In register_fdwxact(), As we are just storing the callback function pointers from
> FdwRoutine in fdw_part structure, So I think we can avoid calling
> GetFdwRoutineByServerId() in TopMemoryContext.
> So I have moved the MemoryContextSwitch to TopMemoryContext after the
> GetFdwRoutineByServerId() call.

Agreed.

>
>
> 2- If PrepareForeignTransaction functionality is not present in some FDW then
> during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> transaction flag if the modified flag is also set for that server. As for the server that has
> not done any data modification within the transaction we do not do two-phase commit anyway.

Agreed.

>
> 3- I have moved the foreign_twophase_commit in sample file after
> max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> is 0 and enabling the foreign_twophase_commit produces an error with default
> configuration parameter positioning in postgresql.conf
> Also, foreign_twophase_commit configuration was missing the comments
> about allowed values in the sample config file.

Sounds good. Agreed.

>
> 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> function does not seem to be the correct place. The reason being, even when
> is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> to true, we could still end up not using the two-phase commit in the case when some server does
> not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> function after doing the prepare transaction.

Agreed.

>
> 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> the two-phase commit. But instead of doing the single-phase commit right away,
> IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> on servers that support the two-phase. Since an error during a "PREPARE" stage would
> rollback the transaction and in that case, we would end up with committed transactions on
> the server that lacks the support of the two-phase commit.

When an error occurred before the local commit, a 2pc-unsupported
server could be rolled back or committed depending on the error
timing. On the other hand all 2pc-supported servers are always rolled
back when an error occurred before the local commit. Therefore even if
we change the order of COMMIT and PREPARE it is still possible that we
will end up committing the part of 2pc-unsupported servers while
rolling back others including 2pc-supported servers.

I guess the motivation of your change is that since errors are likely
to happen during executing PREPARE on foreign servers, we can minimize
the possibility of rolling back 2pc-unsupported servers by deferring
the commit of 2pc-unsupported server as much as possible. Is that
right?

Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
after all the PREPAREs are successful is to minimize the chances of partial commits.
And as you mentioned there will still be chances of getting a partial commit even with
this approach but the probability of that would be less than what it is with the
current sequence.

 

> So I have modified the flow a little bit and instead of doing a one-phase commit right away
> the servers that do not support a two-phase commit is added to another list and that list is
> processed after once we have successfully prepared all the transactions on two-phase supported
> foreign servers. Although this technique is also not bulletproof, still it is better than doing
> the one-phase commits before doing the PREPAREs.

Hmm the current logic seems complex. Maybe we can just reverse the
order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
modified servers first and then do COMMIT on others?

Agreed, seems reasonable. 

>
> Also, I think we can improve on this one by throwing an error even in PREFER
> mode if there is more than one server that had data modified within the transaction
> and lacks the two-phase commit support.
>

IIUC the concept of PREFER mode is that the transaction uses 2pc only
for 2pc-supported servers. IOW, even if the transaction modifies on a
2pc-unsupported server we can proceed with the commit if in PREFER
mode, which cannot if in REQUIRED mode. What is the motivation of your
above idea?

I was thinking that we could change the behavior of PREFER mode such that we only allow
to COMMIT the transaction if the transaction needs to do a single-phase commit on one
server only. That way we can ensure that we would never end up with partial commit.

One Idea in this regards would be to switch the local transaction to commit using 2pc
if there is a total of only one foreign server that does not support the 2pc in the transaction,
ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
than one foreign server requires 1-pc then we just throw an error.

However having said that, I am not 100% sure if its a good or an acceptable Idea, and
I am okay with continuing with the current behavior of PREFER mode if we put it in the
document that this mode can cause a partial commit.


> 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> memory if fdw_part is removed from the list

I think at the end of the transaction we free entries of
FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
need to do that in PreCommit_FdwXacts()?

Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
and if that fdw_part structure is removed from the list at pre_commit stage
(because we did 1-PC COMMIT on it) then it would leak memory.


>
> 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> (FdwXactParticipants == NIL). The problem with that was in the case of
> "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> effectively the foreign prepared transactions(if any) associated with locally
> prepared transactions were never getting resolved automatically.
>
>
> postgres=# BEGIN;
> BEGIN
> INSERT INTO test_local  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> INSERT 0 1
> postgres=*# PREPARE TRANSACTION 'local_prepared';
> PREPARE TRANSACTION
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
> -- Now commit the prepared transaction
>
> postgres=# COMMIT PREPARED 'local_prepared';
>
> COMMIT PREPARED
>
> --Foreign prepared transactions associated with 'local_prepared' not resolved
>
> postgres=#
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
>
> So to fix this in case of the two-phase transaction, the function checks the existence
> of associated foreign prepared transactions before bailing out.
>

Good catch. But looking at your change, we should not accept the case
where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
false.

       if (FdwXactParticipants == NIL)
       {
               /*
                * If we are here because of COMMIT/ROLLBACK PREPARED then the
                * FdwXactParticipants list would be empty. So we need to
                * see if there are any foreign prepared transactions exists
                * for this prepared transaction
                */
               if (TwoPhaseExists(wait_xid))
               {
                       List *foreign_trans = NIL;

                       foreign_trans = get_fdwxacts(MyDatabaseId,
wait_xid, InvalidOid, InvalidOid,
                                        false, false, true);

                       if (foreign_trans == NIL)
                               return;
                       list_free(foreign_trans);
               }
       }

 
Sorry my bad, its a mistake on my part. we should just return from the function when
FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.

        if (TwoPhaseExists(wait_xid))
        {    
            List *foreign_trans = NIL;
            foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
                     false, false, true);

            if (foreign_trans == NIL)
                return;
            list_free(foreign_trans);
        }
        else
            return;
  
> 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> that was causing the crash during recovery.

Agreed.

>
> 10- incorporated set_ps_display() signature change.

Thanks.

Regarding other changes you did in v19 patch, I have some comments:

1.
+       ereport(LOG,
+                       (errmsg("trying to %s the foreign transaction
associated with transaction %u on server %u",
+                                       fdwxact->status ==
FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
+                                       fdwxact->local_xid,
fdwxact->serverid)));
+

Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?

That change was not intended to get into the patch file. I had done it during testing to
quickly get info on which way the transaction is going to be resolved.


2.
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
deleted file mode 120000
index ce8c21880c..0000000000
--- a/src/bin/pg_waldump/fdwxactdesc.c
+++ /dev/null
@@ -1 +0,0 @@
-../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 100644
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c

We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.

Again sorry! that was an oversight on my part.  


3.
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
  Ss   11:34   0:00 postgres: ser
          <entry><literal>SafeSnapshot</literal></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY
DEFERRABLE</literal> transaction.</entry>
         </row>
-        <row>
-         <entry><literal>SyncRep</literal></entry>
-         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
-        </row>
         <row>
          <entry><literal>FdwXactResolution</literal></entry>
          <entry>Waiting for all foreign transaction participants to
be resolved during atomic commit among foreign servers.</entry>
         </row>
+        <row>
+         <entry><literal>SyncRep</literal></entry>
+         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
+        </row>
         <row>
          <entry morerows="4"><literal>Timeout</literal></entry>
          <entry><literal>BaseBackupThrottle</literal></entry>

We need to move the entry of FdwXactResolution to right before
Hash/Batch/Allocating for alphabetical order.

Agreed! 

I've incorporated your changes I agreed with to my local branch and
will incorporate other changes after discussion. I'll also do more
test and self-review and will submit the latest version patch.


Meanwhile, I found a couple of more small issues, One is the break statement missing
i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
could return un-initialized value.
I am attaching a small patch for these changes that can be applied on top of existing
patches.



Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Best Regards,
Muhammad Usama
Highgo Software
URL : http://www.highgo.ca

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> > Hi Sawada San,
>> >
>> > I have been further reviewing and testing the transaction involving multiple server patches.
>> > Overall the patches are working as expected bar a few important exceptions.
>> > So as discussed over the call I have fixed the issues I found during the testing
>> > and also rebased the patches with the current head of the master branch.
>> > So can you please have a look at the attached updated patches.
>>
>> Thank you for reviewing and updating the patch!
>>
>> >
>> > Below is the list of changes I have made on top of V18 patches.
>> >
>> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
>> > FdwRoutine in fdw_part structure, So I think we can avoid calling
>> > GetFdwRoutineByServerId() in TopMemoryContext.
>> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
>> > GetFdwRoutineByServerId() call.
>>
>> Agreed.
>>
>> >
>> >
>> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
>> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
>> > transaction flag if the modified flag is also set for that server. As for the server that has
>> > not done any data modification within the transaction we do not do two-phase commit anyway.
>>
>> Agreed.
>>
>> >
>> > 3- I have moved the foreign_twophase_commit in sample file after
>> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
>> > is 0 and enabling the foreign_twophase_commit produces an error with default
>> > configuration parameter positioning in postgresql.conf
>> > Also, foreign_twophase_commit configuration was missing the comments
>> > about allowed values in the sample config file.
>>
>> Sounds good. Agreed.
>>
>> >
>> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
>> > function does not seem to be the correct place. The reason being, even when
>> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
>> > to true, we could still end up not using the two-phase commit in the case when some server does
>> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
>> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
>> > function after doing the prepare transaction.
>>
>> Agreed.
>>
>> >
>> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
>> > the two-phase commit. But instead of doing the single-phase commit right away,
>> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
>> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
>> > rollback the transaction and in that case, we would end up with committed transactions on
>> > the server that lacks the support of the two-phase commit.
>>
>> When an error occurred before the local commit, a 2pc-unsupported
>> server could be rolled back or committed depending on the error
>> timing. On the other hand all 2pc-supported servers are always rolled
>> back when an error occurred before the local commit. Therefore even if
>> we change the order of COMMIT and PREPARE it is still possible that we
>> will end up committing the part of 2pc-unsupported servers while
>> rolling back others including 2pc-supported servers.
>>
>> I guess the motivation of your change is that since errors are likely
>> to happen during executing PREPARE on foreign servers, we can minimize
>> the possibility of rolling back 2pc-unsupported servers by deferring
>> the commit of 2pc-unsupported server as much as possible. Is that
>> right?
>
>
> Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> after all the PREPAREs are successful is to minimize the chances of partial commits.
> And as you mentioned there will still be chances of getting a partial commit even with
> this approach but the probability of that would be less than what it is with the
> current sequence.
>
>
>>
>>
>> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
>> > the servers that do not support a two-phase commit is added to another list and that list is
>> > processed after once we have successfully prepared all the transactions on two-phase supported
>> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
>> > the one-phase commits before doing the PREPAREs.
>>
>> Hmm the current logic seems complex. Maybe we can just reverse the
>> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
>> modified servers first and then do COMMIT on others?
>
>
> Agreed, seems reasonable.
>>
>>
>> >
>> > Also, I think we can improve on this one by throwing an error even in PREFER
>> > mode if there is more than one server that had data modified within the transaction
>> > and lacks the two-phase commit support.
>> >
>>
>> IIUC the concept of PREFER mode is that the transaction uses 2pc only
>> for 2pc-supported servers. IOW, even if the transaction modifies on a
>> 2pc-unsupported server we can proceed with the commit if in PREFER
>> mode, which cannot if in REQUIRED mode. What is the motivation of your
>> above idea?
>
>
> I was thinking that we could change the behavior of PREFER mode such that we only allow
> to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> server only. That way we can ensure that we would never end up with partial commit.
>

I think it's good to avoid a partial commit by using your idea but if
we want to avoid a partial commit we can use the 'required' mode,
which requires all participant servers to support 2pc. We throw an
error if participant servers include even one 2pc-unsupported server
is modified within the transaction. Of course if the participant node
is only one 2pc-unsupported server it can use 1pc even in the
'required' mode.

> One Idea in this regards would be to switch the local transaction to commit using 2pc
> if there is a total of only one foreign server that does not support the 2pc in the transaction,
> ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> than one foreign server requires 1-pc then we just throw an error.

I might be missing your point but I suppose this idea is to do
something like the following?

1. prepare the local transaction
2. commit the foreign transaction on 2pc-unsupported server
3. commit the prepared local transaction

>
> However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> I am okay with continuing with the current behavior of PREFER mode if we put it in the
> document that this mode can cause a partial commit.

There will three types of servers: (a) a server doesn't support any
transaction API, (b) a server supports only commit and rollback API
and (c) a server supports all APIs (commit, rollback and prepare).
Currently postgres transaction manager manages only server-(b) and
server-(c), adds them to FdwXactParticipants. I'm considering changing
the code so that it adds also server-(a) to FdwXactParticipants, in
order to track the number of server-(a) involved in the transaction.
But it doesn't insert FdwXact entry for it, and manage transactions on
these servers.

The reason is this; if we want to have the 'required' mode strictly
require all participant servers to support 2pc, we should use 2pc when
(# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
currently we just track the modification on a server-(a) by a flag we
cannot handle the case where two server-(a) are modified in the
transaction. On the other hand, if we don't consider server-(a) the
transaction could end up with a partial commit when a server-(a)
participates in the transaction. Therefore I'm thinking of the above
change so that the transaction manager can ensure that a partial
commit doesn't happen in the 'required' mode. What do you think?

>
>>
>> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
>> > memory if fdw_part is removed from the list
>>
>> I think at the end of the transaction we free entries of
>> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
>> need to do that in PreCommit_FdwXacts()?
>
>
> Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> and if that fdw_part structure is removed from the list at pre_commit stage
> (because we did 1-PC COMMIT on it) then it would leak memory.

The fdw_part structures are created in TopTransactionContext so these
are freed at the end of the transaction.

>
>>
>> >
>> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
>> > (FdwXactParticipants == NIL). The problem with that was in the case of
>> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
>> > effectively the foreign prepared transactions(if any) associated with locally
>> > prepared transactions were never getting resolved automatically.
>> >
>> >
>> > postgres=# BEGIN;
>> > BEGIN
>> > INSERT INTO test_local  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > postgres=*# PREPARE TRANSACTION 'local_prepared';
>> > PREPARE TRANSACTION
>> >
>> > postgres=# select * from pg_foreign_xacts ;
>> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
>> > -------+-----+----------+--------+----------+----------+----------------------------
>> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
>> > (2 rows)
>> >
>> > -- Now commit the prepared transaction
>> >
>> > postgres=# COMMIT PREPARED 'local_prepared';
>> >
>> > COMMIT PREPARED
>> >
>> > --Foreign prepared transactions associated with 'local_prepared' not resolved
>> >
>> > postgres=#
>> >
>> > postgres=# select * from pg_foreign_xacts ;
>> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
>> > -------+-----+----------+--------+----------+----------+----------------------------
>> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
>> > (2 rows)
>> >
>> >
>> > So to fix this in case of the two-phase transaction, the function checks the existence
>> > of associated foreign prepared transactions before bailing out.
>> >
>>
>> Good catch. But looking at your change, we should not accept the case
>> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
>> false.
>>
>>        if (FdwXactParticipants == NIL)
>>        {
>>                /*
>>                 * If we are here because of COMMIT/ROLLBACK PREPARED then the
>>                 * FdwXactParticipants list would be empty. So we need to
>>                 * see if there are any foreign prepared transactions exists
>>                 * for this prepared transaction
>>                 */
>>                if (TwoPhaseExists(wait_xid))
>>                {
>>                        List *foreign_trans = NIL;
>>
>>                        foreign_trans = get_fdwxacts(MyDatabaseId,
>> wait_xid, InvalidOid, InvalidOid,
>>                                         false, false, true);
>>
>>                        if (foreign_trans == NIL)
>>                                return;
>>                        list_free(foreign_trans);
>>                }
>>        }
>>
>
> Sorry my bad, its a mistake on my part. we should just return from the function when
> FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
>
>         if (TwoPhaseExists(wait_xid))
>         {
>             List *foreign_trans = NIL;
>             foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
>                      false, false, true);
>
>             if (foreign_trans == NIL)
>                 return;
>             list_free(foreign_trans);
>         }
>         else
>             return;
>
>>
>> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
>> > that was causing the crash during recovery.
>>
>> Agreed.
>>
>> >
>> > 10- incorporated set_ps_display() signature change.
>>
>> Thanks.
>>
>> Regarding other changes you did in v19 patch, I have some comments:
>>
>> 1.
>> +       ereport(LOG,
>> +                       (errmsg("trying to %s the foreign transaction
>> associated with transaction %u on server %u",
>> +                                       fdwxact->status ==
>> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
>> +                                       fdwxact->local_xid,
>> fdwxact->serverid)));
>> +
>>
>> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
>
>
> That change was not intended to get into the patch file. I had done it during testing to
> quickly get info on which way the transaction is going to be resolved.
>
>>
>> 2.
>> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
>> deleted file mode 120000
>> index ce8c21880c..0000000000
>> --- a/src/bin/pg_waldump/fdwxactdesc.c
>> +++ /dev/null
>> @@ -1 +0,0 @@
>> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
>> \ No newline at end of file
>> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
>> new file mode 100644
>> index 0000000000..ce8c21880c
>> --- /dev/null
>> +++ b/src/bin/pg_waldump/fdwxactdesc.c
>> @@ -0,0 +1 @@
>> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
>>
>> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
>
>
> Again sorry! that was an oversight on my part.
>
>>
>> 3.
>> --- a/doc/src/sgml/monitoring.sgml
>> +++ b/doc/src/sgml/monitoring.sgml
>> @@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
>>   Ss   11:34   0:00 postgres: ser
>>           <entry><literal>SafeSnapshot</literal></entry>
>>           <entry>Waiting for a snapshot for a <literal>READ ONLY
>> DEFERRABLE</literal> transaction.</entry>
>>          </row>
>> -        <row>
>> -         <entry><literal>SyncRep</literal></entry>
>> -         <entry>Waiting for confirmation from remote server during
>> synchronous replication.</entry>
>> -        </row>
>>          <row>
>>           <entry><literal>FdwXactResolution</literal></entry>
>>           <entry>Waiting for all foreign transaction participants to
>> be resolved during atomic commit among foreign servers.</entry>
>>          </row>
>> +        <row>
>> +         <entry><literal>SyncRep</literal></entry>
>> +         <entry>Waiting for confirmation from remote server during
>> synchronous replication.</entry>
>> +        </row>
>>          <row>
>>           <entry morerows="4"><literal>Timeout</literal></entry>
>>           <entry><literal>BaseBackupThrottle</literal></entry>
>>
>> We need to move the entry of FdwXactResolution to right before
>> Hash/Batch/Allocating for alphabetical order.
>
>
> Agreed!
>>
>>
>> I've incorporated your changes I agreed with to my local branch and
>> will incorporate other changes after discussion. I'll also do more
>> test and self-review and will submit the latest version patch.
>>
>
> Meanwhile, I found a couple of more small issues, One is the break statement missing
> i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> could return un-initialized value.
> I am attaching a small patch for these changes that can be applied on top of existing
> patches.

Thank you for the patch!

I'm updating the patches because current behavior in error case would
not be good. For example, when an error occurs in the prepare phase,
prepared transactions are left as in-doubt transaction. And these
transactions are not handled by the resolver process. That means that
a user could need to resolve these transactions manually every abort
time, which is not good. In abort case, I think that prepared
transactions can be resolved by the backend itself, rather than
leaving them for the resolver. I'll submit the updated patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 30 Apr 2020 at 20:43, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
> >
> >
> >
> > On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
> >> >
> >> > Hi Sawada San,
> >> >
> >> > I have been further reviewing and testing the transaction involving multiple server patches.
> >> > Overall the patches are working as expected bar a few important exceptions.
> >> > So as discussed over the call I have fixed the issues I found during the testing
> >> > and also rebased the patches with the current head of the master branch.
> >> > So can you please have a look at the attached updated patches.
> >>
> >> Thank you for reviewing and updating the patch!
> >>
> >> >
> >> > Below is the list of changes I have made on top of V18 patches.
> >> >
> >> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
> >> > FdwRoutine in fdw_part structure, So I think we can avoid calling
> >> > GetFdwRoutineByServerId() in TopMemoryContext.
> >> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
> >> > GetFdwRoutineByServerId() call.
> >>
> >> Agreed.
> >>
> >> >
> >> >
> >> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
> >> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> >> > transaction flag if the modified flag is also set for that server. As for the server that has
> >> > not done any data modification within the transaction we do not do two-phase commit anyway.
> >>
> >> Agreed.
> >>
> >> >
> >> > 3- I have moved the foreign_twophase_commit in sample file after
> >> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> >> > is 0 and enabling the foreign_twophase_commit produces an error with default
> >> > configuration parameter positioning in postgresql.conf
> >> > Also, foreign_twophase_commit configuration was missing the comments
> >> > about allowed values in the sample config file.
> >>
> >> Sounds good. Agreed.
> >>
> >> >
> >> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> >> > function does not seem to be the correct place. The reason being, even when
> >> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> >> > to true, we could still end up not using the two-phase commit in the case when some server does
> >> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> >> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> >> > function after doing the prepare transaction.
> >>
> >> Agreed.
> >>
> >> >
> >> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> >> > the two-phase commit. But instead of doing the single-phase commit right away,
> >> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> >> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
> >> > rollback the transaction and in that case, we would end up with committed transactions on
> >> > the server that lacks the support of the two-phase commit.
> >>
> >> When an error occurred before the local commit, a 2pc-unsupported
> >> server could be rolled back or committed depending on the error
> >> timing. On the other hand all 2pc-supported servers are always rolled
> >> back when an error occurred before the local commit. Therefore even if
> >> we change the order of COMMIT and PREPARE it is still possible that we
> >> will end up committing the part of 2pc-unsupported servers while
> >> rolling back others including 2pc-supported servers.
> >>
> >> I guess the motivation of your change is that since errors are likely
> >> to happen during executing PREPARE on foreign servers, we can minimize
> >> the possibility of rolling back 2pc-unsupported servers by deferring
> >> the commit of 2pc-unsupported server as much as possible. Is that
> >> right?
> >
> >
> > Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> > after all the PREPAREs are successful is to minimize the chances of partial commits.
> > And as you mentioned there will still be chances of getting a partial commit even with
> > this approach but the probability of that would be less than what it is with the
> > current sequence.
> >
> >
> >>
> >>
> >> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
> >> > the servers that do not support a two-phase commit is added to another list and that list is
> >> > processed after once we have successfully prepared all the transactions on two-phase supported
> >> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
> >> > the one-phase commits before doing the PREPAREs.
> >>
> >> Hmm the current logic seems complex. Maybe we can just reverse the
> >> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
> >> modified servers first and then do COMMIT on others?
> >
> >
> > Agreed, seems reasonable.
> >>
> >>
> >> >
> >> > Also, I think we can improve on this one by throwing an error even in PREFER
> >> > mode if there is more than one server that had data modified within the transaction
> >> > and lacks the two-phase commit support.
> >> >
> >>
> >> IIUC the concept of PREFER mode is that the transaction uses 2pc only
> >> for 2pc-supported servers. IOW, even if the transaction modifies on a
> >> 2pc-unsupported server we can proceed with the commit if in PREFER
> >> mode, which cannot if in REQUIRED mode. What is the motivation of your
> >> above idea?
> >
> >
> > I was thinking that we could change the behavior of PREFER mode such that we only allow
> > to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> > server only. That way we can ensure that we would never end up with partial commit.
> >
>
> I think it's good to avoid a partial commit by using your idea but if
> we want to avoid a partial commit we can use the 'required' mode,
> which requires all participant servers to support 2pc. We throw an
> error if participant servers include even one 2pc-unsupported server
> is modified within the transaction. Of course if the participant node
> is only one 2pc-unsupported server it can use 1pc even in the
> 'required' mode.
>
> > One Idea in this regards would be to switch the local transaction to commit using 2pc
> > if there is a total of only one foreign server that does not support the 2pc in the transaction,
> > ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> > than one foreign server requires 1-pc then we just throw an error.
>
> I might be missing your point but I suppose this idea is to do
> something like the following?
>
> 1. prepare the local transaction
> 2. commit the foreign transaction on 2pc-unsupported server
> 3. commit the prepared local transaction
>
> >
> > However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> > I am okay with continuing with the current behavior of PREFER mode if we put it in the
> > document that this mode can cause a partial commit.
>
> There will three types of servers: (a) a server doesn't support any
> transaction API, (b) a server supports only commit and rollback API
> and (c) a server supports all APIs (commit, rollback and prepare).
> Currently postgres transaction manager manages only server-(b) and
> server-(c), adds them to FdwXactParticipants. I'm considering changing
> the code so that it adds also server-(a) to FdwXactParticipants, in
> order to track the number of server-(a) involved in the transaction.
> But it doesn't insert FdwXact entry for it, and manage transactions on
> these servers.
>
> The reason is this; if we want to have the 'required' mode strictly
> require all participant servers to support 2pc, we should use 2pc when
> (# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
> currently we just track the modification on a server-(a) by a flag we
> cannot handle the case where two server-(a) are modified in the
> transaction. On the other hand, if we don't consider server-(a) the
> transaction could end up with a partial commit when a server-(a)
> participates in the transaction. Therefore I'm thinking of the above
> change so that the transaction manager can ensure that a partial
> commit doesn't happen in the 'required' mode. What do you think?
>
> >
> >>
> >> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> >> > memory if fdw_part is removed from the list
> >>
> >> I think at the end of the transaction we free entries of
> >> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
> >> need to do that in PreCommit_FdwXacts()?
> >
> >
> > Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> > and if that fdw_part structure is removed from the list at pre_commit stage
> > (because we did 1-PC COMMIT on it) then it would leak memory.
>
> The fdw_part structures are created in TopTransactionContext so these
> are freed at the end of the transaction.
>
> >
> >>
> >> >
> >> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> >> > (FdwXactParticipants == NIL). The problem with that was in the case of
> >> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> >> > effectively the foreign prepared transactions(if any) associated with locally
> >> > prepared transactions were never getting resolved automatically.
> >> >
> >> >
> >> > postgres=# BEGIN;
> >> > BEGIN
> >> > INSERT INTO test_local  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > postgres=*# PREPARE TRANSACTION 'local_prepared';
> >> > PREPARE TRANSACTION
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> > -- Now commit the prepared transaction
> >> >
> >> > postgres=# COMMIT PREPARED 'local_prepared';
> >> >
> >> > COMMIT PREPARED
> >> >
> >> > --Foreign prepared transactions associated with 'local_prepared' not resolved
> >> >
> >> > postgres=#
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> >
> >> > So to fix this in case of the two-phase transaction, the function checks the existence
> >> > of associated foreign prepared transactions before bailing out.
> >> >
> >>
> >> Good catch. But looking at your change, we should not accept the case
> >> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
> >> false.
> >>
> >>        if (FdwXactParticipants == NIL)
> >>        {
> >>                /*
> >>                 * If we are here because of COMMIT/ROLLBACK PREPARED then the
> >>                 * FdwXactParticipants list would be empty. So we need to
> >>                 * see if there are any foreign prepared transactions exists
> >>                 * for this prepared transaction
> >>                 */
> >>                if (TwoPhaseExists(wait_xid))
> >>                {
> >>                        List *foreign_trans = NIL;
> >>
> >>                        foreign_trans = get_fdwxacts(MyDatabaseId,
> >> wait_xid, InvalidOid, InvalidOid,
> >>                                         false, false, true);
> >>
> >>                        if (foreign_trans == NIL)
> >>                                return;
> >>                        list_free(foreign_trans);
> >>                }
> >>        }
> >>
> >
> > Sorry my bad, its a mistake on my part. we should just return from the function when
> > FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
> >
> >         if (TwoPhaseExists(wait_xid))
> >         {
> >             List *foreign_trans = NIL;
> >             foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
> >                      false, false, true);
> >
> >             if (foreign_trans == NIL)
> >                 return;
> >             list_free(foreign_trans);
> >         }
> >         else
> >             return;
> >
> >>
> >> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> >> > that was causing the crash during recovery.
> >>
> >> Agreed.
> >>
> >> >
> >> > 10- incorporated set_ps_display() signature change.
> >>
> >> Thanks.
> >>
> >> Regarding other changes you did in v19 patch, I have some comments:
> >>
> >> 1.
> >> +       ereport(LOG,
> >> +                       (errmsg("trying to %s the foreign transaction
> >> associated with transaction %u on server %u",
> >> +                                       fdwxact->status ==
> >> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
> >> +                                       fdwxact->local_xid,
> >> fdwxact->serverid)));
> >> +
> >>
> >> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
> >
> >
> > That change was not intended to get into the patch file. I had done it during testing to
> > quickly get info on which way the transaction is going to be resolved.
> >
> >>
> >> 2.
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> deleted file mode 120000
> >> index ce8c21880c..0000000000
> >> --- a/src/bin/pg_waldump/fdwxactdesc.c
> >> +++ /dev/null
> >> @@ -1 +0,0 @@
> >> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >> \ No newline at end of file
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> new file mode 100644
> >> index 0000000000..ce8c21880c
> >> --- /dev/null
> >> +++ b/src/bin/pg_waldump/fdwxactdesc.c
> >> @@ -0,0 +1 @@
> >> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >>
> >> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
> >
> >
> > Again sorry! that was an oversight on my part.
> >
> >>
> >> 3.
> >> --- a/doc/src/sgml/monitoring.sgml
> >> +++ b/doc/src/sgml/monitoring.sgml
> >> @@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
> >>   Ss   11:34   0:00 postgres: ser
> >>           <entry><literal>SafeSnapshot</literal></entry>
> >>           <entry>Waiting for a snapshot for a <literal>READ ONLY
> >> DEFERRABLE</literal> transaction.</entry>
> >>          </row>
> >> -        <row>
> >> -         <entry><literal>SyncRep</literal></entry>
> >> -         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> -        </row>
> >>          <row>
> >>           <entry><literal>FdwXactResolution</literal></entry>
> >>           <entry>Waiting for all foreign transaction participants to
> >> be resolved during atomic commit among foreign servers.</entry>
> >>          </row>
> >> +        <row>
> >> +         <entry><literal>SyncRep</literal></entry>
> >> +         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> +        </row>
> >>          <row>
> >>           <entry morerows="4"><literal>Timeout</literal></entry>
> >>           <entry><literal>BaseBackupThrottle</literal></entry>
> >>
> >> We need to move the entry of FdwXactResolution to right before
> >> Hash/Batch/Allocating for alphabetical order.
> >
> >
> > Agreed!
> >>
> >>
> >> I've incorporated your changes I agreed with to my local branch and
> >> will incorporate other changes after discussion. I'll also do more
> >> test and self-review and will submit the latest version patch.
> >>
> >
> > Meanwhile, I found a couple of more small issues, One is the break statement missing
> > i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> > could return un-initialized value.
> > I am attaching a small patch for these changes that can be applied on top of existing
> > patches.
>
> Thank you for the patch!
>
> I'm updating the patches because current behavior in error case would
> not be good. For example, when an error occurs in the prepare phase,
> prepared transactions are left as in-doubt transaction. And these
> transactions are not handled by the resolver process. That means that
> a user could need to resolve these transactions manually every abort
> time, which is not good. In abort case, I think that prepared
> transactions can be resolved by the backend itself, rather than
> leaving them for the resolver. I'll submit the updated patch.
>

I've attached the latest version patch set which includes some changes
from the previous version:

* I've added regression tests that test all types of FDW
implementations. There are three types of FDW: FDW doesn't support any
transaction APIs, FDW supports only commit and rollback APIs and FDW
supports all (prepare, commit and rollback) APISs.
src/test/module/test_fdwxact contains those FDW implementations for
tests, and test some cases where a transaction reads/writes data on
various types of foreign servers.
* Also test_fdwxact has TAP tests that check failure cases. The test
FDW implementation has the ability to inject error or panic into
prepare or commit phase. Using it the TAP test checks if distributed
transactions can be committed or rolled back even in failure cases.
* When foreign_twophase_commit = 'required', the transaction commit
fails if the transaction modified data on even one server not
supporting prepare API. Previously, we used to ignore servers that
don't support any transaction API but we check them to strictly
require all involved foreign servers to support all transaction APIs.
* Transaction resolver process resolves in-doubt transactions automatically.
* Incorporated comments from Muhammad Usama.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Tue, May 12, 2020 at 11:45 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Thu, 30 Apr 2020 at 20:43, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
> >
> >
> >
> > On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
> >> >
> >> > Hi Sawada San,
> >> >
> >> > I have been further reviewing and testing the transaction involving multiple server patches.
> >> > Overall the patches are working as expected bar a few important exceptions.
> >> > So as discussed over the call I have fixed the issues I found during the testing
> >> > and also rebased the patches with the current head of the master branch.
> >> > So can you please have a look at the attached updated patches.
> >>
> >> Thank you for reviewing and updating the patch!
> >>
> >> >
> >> > Below is the list of changes I have made on top of V18 patches.
> >> >
> >> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
> >> > FdwRoutine in fdw_part structure, So I think we can avoid calling
> >> > GetFdwRoutineByServerId() in TopMemoryContext.
> >> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
> >> > GetFdwRoutineByServerId() call.
> >>
> >> Agreed.
> >>
> >> >
> >> >
> >> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
> >> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> >> > transaction flag if the modified flag is also set for that server. As for the server that has
> >> > not done any data modification within the transaction we do not do two-phase commit anyway.
> >>
> >> Agreed.
> >>
> >> >
> >> > 3- I have moved the foreign_twophase_commit in sample file after
> >> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> >> > is 0 and enabling the foreign_twophase_commit produces an error with default
> >> > configuration parameter positioning in postgresql.conf
> >> > Also, foreign_twophase_commit configuration was missing the comments
> >> > about allowed values in the sample config file.
> >>
> >> Sounds good. Agreed.
> >>
> >> >
> >> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> >> > function does not seem to be the correct place. The reason being, even when
> >> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> >> > to true, we could still end up not using the two-phase commit in the case when some server does
> >> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> >> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> >> > function after doing the prepare transaction.
> >>
> >> Agreed.
> >>
> >> >
> >> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> >> > the two-phase commit. But instead of doing the single-phase commit right away,
> >> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> >> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
> >> > rollback the transaction and in that case, we would end up with committed transactions on
> >> > the server that lacks the support of the two-phase commit.
> >>
> >> When an error occurred before the local commit, a 2pc-unsupported
> >> server could be rolled back or committed depending on the error
> >> timing. On the other hand all 2pc-supported servers are always rolled
> >> back when an error occurred before the local commit. Therefore even if
> >> we change the order of COMMIT and PREPARE it is still possible that we
> >> will end up committing the part of 2pc-unsupported servers while
> >> rolling back others including 2pc-supported servers.
> >>
> >> I guess the motivation of your change is that since errors are likely
> >> to happen during executing PREPARE on foreign servers, we can minimize
> >> the possibility of rolling back 2pc-unsupported servers by deferring
> >> the commit of 2pc-unsupported server as much as possible. Is that
> >> right?
> >
> >
> > Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> > after all the PREPAREs are successful is to minimize the chances of partial commits.
> > And as you mentioned there will still be chances of getting a partial commit even with
> > this approach but the probability of that would be less than what it is with the
> > current sequence.
> >
> >
> >>
> >>
> >> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
> >> > the servers that do not support a two-phase commit is added to another list and that list is
> >> > processed after once we have successfully prepared all the transactions on two-phase supported
> >> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
> >> > the one-phase commits before doing the PREPAREs.
> >>
> >> Hmm the current logic seems complex. Maybe we can just reverse the
> >> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
> >> modified servers first and then do COMMIT on others?
> >
> >
> > Agreed, seems reasonable.
> >>
> >>
> >> >
> >> > Also, I think we can improve on this one by throwing an error even in PREFER
> >> > mode if there is more than one server that had data modified within the transaction
> >> > and lacks the two-phase commit support.
> >> >
> >>
> >> IIUC the concept of PREFER mode is that the transaction uses 2pc only
> >> for 2pc-supported servers. IOW, even if the transaction modifies on a
> >> 2pc-unsupported server we can proceed with the commit if in PREFER
> >> mode, which cannot if in REQUIRED mode. What is the motivation of your
> >> above idea?
> >
> >
> > I was thinking that we could change the behavior of PREFER mode such that we only allow
> > to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> > server only. That way we can ensure that we would never end up with partial commit.
> >
>
> I think it's good to avoid a partial commit by using your idea but if
> we want to avoid a partial commit we can use the 'required' mode,
> which requires all participant servers to support 2pc. We throw an
> error if participant servers include even one 2pc-unsupported server
> is modified within the transaction. Of course if the participant node
> is only one 2pc-unsupported server it can use 1pc even in the
> 'required' mode.
>
> > One Idea in this regards would be to switch the local transaction to commit using 2pc
> > if there is a total of only one foreign server that does not support the 2pc in the transaction,
> > ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> > than one foreign server requires 1-pc then we just throw an error.
>
> I might be missing your point but I suppose this idea is to do
> something like the following?
>
> 1. prepare the local transaction
> 2. commit the foreign transaction on 2pc-unsupported server
> 3. commit the prepared local transaction
>
> >
> > However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> > I am okay with continuing with the current behavior of PREFER mode if we put it in the
> > document that this mode can cause a partial commit.
>
> There will three types of servers: (a) a server doesn't support any
> transaction API, (b) a server supports only commit and rollback API
> and (c) a server supports all APIs (commit, rollback and prepare).
> Currently postgres transaction manager manages only server-(b) and
> server-(c), adds them to FdwXactParticipants. I'm considering changing
> the code so that it adds also server-(a) to FdwXactParticipants, in
> order to track the number of server-(a) involved in the transaction.
> But it doesn't insert FdwXact entry for it, and manage transactions on
> these servers.
>
> The reason is this; if we want to have the 'required' mode strictly
> require all participant servers to support 2pc, we should use 2pc when
> (# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
> currently we just track the modification on a server-(a) by a flag we
> cannot handle the case where two server-(a) are modified in the
> transaction. On the other hand, if we don't consider server-(a) the
> transaction could end up with a partial commit when a server-(a)
> participates in the transaction. Therefore I'm thinking of the above
> change so that the transaction manager can ensure that a partial
> commit doesn't happen in the 'required' mode. What do you think?
>
> >
> >>
> >> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> >> > memory if fdw_part is removed from the list
> >>
> >> I think at the end of the transaction we free entries of
> >> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
> >> need to do that in PreCommit_FdwXacts()?
> >
> >
> > Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> > and if that fdw_part structure is removed from the list at pre_commit stage
> > (because we did 1-PC COMMIT on it) then it would leak memory.
>
> The fdw_part structures are created in TopTransactionContext so these
> are freed at the end of the transaction.
>
> >
> >>
> >> >
> >> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> >> > (FdwXactParticipants == NIL). The problem with that was in the case of
> >> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> >> > effectively the foreign prepared transactions(if any) associated with locally
> >> > prepared transactions were never getting resolved automatically.
> >> >
> >> >
> >> > postgres=# BEGIN;
> >> > BEGIN
> >> > INSERT INTO test_local  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > postgres=*# PREPARE TRANSACTION 'local_prepared';
> >> > PREPARE TRANSACTION
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> > -- Now commit the prepared transaction
> >> >
> >> > postgres=# COMMIT PREPARED 'local_prepared';
> >> >
> >> > COMMIT PREPARED
> >> >
> >> > --Foreign prepared transactions associated with 'local_prepared' not resolved
> >> >
> >> > postgres=#
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> >
> >> > So to fix this in case of the two-phase transaction, the function checks the existence
> >> > of associated foreign prepared transactions before bailing out.
> >> >
> >>
> >> Good catch. But looking at your change, we should not accept the case
> >> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
> >> false.
> >>
> >>        if (FdwXactParticipants == NIL)
> >>        {
> >>                /*
> >>                 * If we are here because of COMMIT/ROLLBACK PREPARED then the
> >>                 * FdwXactParticipants list would be empty. So we need to
> >>                 * see if there are any foreign prepared transactions exists
> >>                 * for this prepared transaction
> >>                 */
> >>                if (TwoPhaseExists(wait_xid))
> >>                {
> >>                        List *foreign_trans = NIL;
> >>
> >>                        foreign_trans = get_fdwxacts(MyDatabaseId,
> >> wait_xid, InvalidOid, InvalidOid,
> >>                                         false, false, true);
> >>
> >>                        if (foreign_trans == NIL)
> >>                                return;
> >>                        list_free(foreign_trans);
> >>                }
> >>        }
> >>
> >
> > Sorry my bad, its a mistake on my part. we should just return from the function when
> > FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
> >
> >         if (TwoPhaseExists(wait_xid))
> >         {
> >             List *foreign_trans = NIL;
> >             foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
> >                      false, false, true);
> >
> >             if (foreign_trans == NIL)
> >                 return;
> >             list_free(foreign_trans);
> >         }
> >         else
> >             return;
> >
> >>
> >> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> >> > that was causing the crash during recovery.
> >>
> >> Agreed.
> >>
> >> >
> >> > 10- incorporated set_ps_display() signature change.
> >>
> >> Thanks.
> >>
> >> Regarding other changes you did in v19 patch, I have some comments:
> >>
> >> 1.
> >> +       ereport(LOG,
> >> +                       (errmsg("trying to %s the foreign transaction
> >> associated with transaction %u on server %u",
> >> +                                       fdwxact->status ==
> >> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
> >> +                                       fdwxact->local_xid,
> >> fdwxact->serverid)));
> >> +
> >>
> >> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
> >
> >
> > That change was not intended to get into the patch file. I had done it during testing to
> > quickly get info on which way the transaction is going to be resolved.
> >
> >>
> >> 2.
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> deleted file mode 120000
> >> index ce8c21880c..0000000000
> >> --- a/src/bin/pg_waldump/fdwxactdesc.c
> >> +++ /dev/null
> >> @@ -1 +0,0 @@
> >> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >> \ No newline at end of file
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> new file mode 100644
> >> index 0000000000..ce8c21880c
> >> --- /dev/null
> >> +++ b/src/bin/pg_waldump/fdwxactdesc.c
> >> @@ -0,0 +1 @@
> >> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >>
> >> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
> >
> >
> > Again sorry! that was an oversight on my part.
> >
> >>
> >> 3.
> >> --- a/doc/src/sgml/monitoring.sgml
> >> +++ b/doc/src/sgml/monitoring.sgml
> >> @@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
> >>   Ss   11:34   0:00 postgres: ser
> >>           <entry><literal>SafeSnapshot</literal></entry>
> >>           <entry>Waiting for a snapshot for a <literal>READ ONLY
> >> DEFERRABLE</literal> transaction.</entry>
> >>          </row>
> >> -        <row>
> >> -         <entry><literal>SyncRep</literal></entry>
> >> -         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> -        </row>
> >>          <row>
> >>           <entry><literal>FdwXactResolution</literal></entry>
> >>           <entry>Waiting for all foreign transaction participants to
> >> be resolved during atomic commit among foreign servers.</entry>
> >>          </row>
> >> +        <row>
> >> +         <entry><literal>SyncRep</literal></entry>
> >> +         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> +        </row>
> >>          <row>
> >>           <entry morerows="4"><literal>Timeout</literal></entry>
> >>           <entry><literal>BaseBackupThrottle</literal></entry>
> >>
> >> We need to move the entry of FdwXactResolution to right before
> >> Hash/Batch/Allocating for alphabetical order.
> >
> >
> > Agreed!
> >>
> >>
> >> I've incorporated your changes I agreed with to my local branch and
> >> will incorporate other changes after discussion. I'll also do more
> >> test and self-review and will submit the latest version patch.
> >>
> >
> > Meanwhile, I found a couple of more small issues, One is the break statement missing
> > i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> > could return un-initialized value.
> > I am attaching a small patch for these changes that can be applied on top of existing
> > patches.
>
> Thank you for the patch!
>
> I'm updating the patches because current behavior in error case would
> not be good. For example, when an error occurs in the prepare phase,
> prepared transactions are left as in-doubt transaction. And these
> transactions are not handled by the resolver process. That means that
> a user could need to resolve these transactions manually every abort
> time, which is not good. In abort case, I think that prepared
> transactions can be resolved by the backend itself, rather than
> leaving them for the resolver. I'll submit the updated patch.
>

I've attached the latest version patch set which includes some changes
from the previous version:

* I've added regression tests that test all types of FDW
implementations. There are three types of FDW: FDW doesn't support any
transaction APIs, FDW supports only commit and rollback APIs and FDW
supports all (prepare, commit and rollback) APISs.
src/test/module/test_fdwxact contains those FDW implementations for
tests, and test some cases where a transaction reads/writes data on
various types of foreign servers.
* Also test_fdwxact has TAP tests that check failure cases. The test
FDW implementation has the ability to inject error or panic into
prepare or commit phase. Using it the TAP test checks if distributed
transactions can be committed or rolled back even in failure cases.
* When foreign_twophase_commit = 'required', the transaction commit
fails if the transaction modified data on even one server not
supporting prepare API. Previously, we used to ignore servers that
don't support any transaction API but we check them to strictly
require all involved foreign servers to support all transaction APIs.
* Transaction resolver process resolves in-doubt transactions automatically.
* Incorporated comments from Muhammad Usama.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Sawada,

I have just done some review and testing of the patches and have
a couple of comments.

1- IMHO the PREPARE TRANSACTION should always use 2PC even
when the transaction has operated on a single foreign server regardless
of foreign_twophase_commit setting, and throw an error otherwise when
2PC is not available on any of the data-modified servers.

For example, consider the case

BEGIN;
INSERT INTO ft_2pc_1 VALUES(1);
PREPARE TRANSACTION 'global_x1';

Here since we are preparing the local transaction so we should also prepare
the transaction on the foreign server even if the transaction has modified only
one foreign table.

What do you think?

Also without this change, the above test case produces an assertion failure
with your patches.

2- when deciding if the two-phase commit is required or not in
FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
2PC when we have at least one server capable of doing that.

i.e

For FOREIGN_TWOPHASE_COMMIT_PREFER case in
checkForeignTwophaseCommitRequired() function I think
the condition should be

need_twophase_commit = (nserverstwophase >= 1);
instead of
need_twophase_commit = (nserverstwophase >= 2);

I am attaching a patch that I have generated on top of your V20
patches with these two modifications along with the related test case.



Best regards!
--
...
Muhammad Usama
Highgo Software (Canada/China/Pakistan) 
ADDR: 10318 WHALLEY BLVD, Surrey, BC 
Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
> Hi Sawada,
>
> I have just done some review and testing of the patches and have
> a couple of comments.

Thank you for reviewing!

>
> 1- IMHO the PREPARE TRANSACTION should always use 2PC even
> when the transaction has operated on a single foreign server regardless
> of foreign_twophase_commit setting, and throw an error otherwise when
> 2PC is not available on any of the data-modified servers.
>
> For example, consider the case
>
> BEGIN;
> INSERT INTO ft_2pc_1 VALUES(1);
> PREPARE TRANSACTION 'global_x1';
>
> Here since we are preparing the local transaction so we should also prepare
> the transaction on the foreign server even if the transaction has modified only
> one foreign table.
>
> What do you think?

Good catch and I agree with you. The transaction should fail if it
opened a transaction on a 2pc-no-support server regardless of
foreign_twophase_commit. And I think we should prepare a transaction
on a foreign server even if it didn't modify any data on that.

>
> Also without this change, the above test case produces an assertion failure
> with your patches.
>
> 2- when deciding if the two-phase commit is required or not in
> FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
> 2PC when we have at least one server capable of doing that.
>
> i.e
>
> For FOREIGN_TWOPHASE_COMMIT_PREFER case in
> checkForeignTwophaseCommitRequired() function I think
> the condition should be
>
> need_twophase_commit = (nserverstwophase >= 1);
> instead of
> need_twophase_commit = (nserverstwophase >= 2);
>

Hmm I might be missing your point but it seems to me that you want to
use two-phase commit even in the case where a transaction modified
data on only one server. Can't we commit distributed transaction
atomically even using one-phase commit in that case?

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
> Hi Sawada,
>
> I have just done some review and testing of the patches and have
> a couple of comments.

Thank you for reviewing!

>
> 1- IMHO the PREPARE TRANSACTION should always use 2PC even
> when the transaction has operated on a single foreign server regardless
> of foreign_twophase_commit setting, and throw an error otherwise when
> 2PC is not available on any of the data-modified servers.
>
> For example, consider the case
>
> BEGIN;
> INSERT INTO ft_2pc_1 VALUES(1);
> PREPARE TRANSACTION 'global_x1';
>
> Here since we are preparing the local transaction so we should also prepare
> the transaction on the foreign server even if the transaction has modified only
> one foreign table.
>
> What do you think?

Good catch and I agree with you. The transaction should fail if it
opened a transaction on a 2pc-no-support server regardless of
foreign_twophase_commit. And I think we should prepare a transaction
on a foreign server even if it didn't modify any data on that.

>
> Also without this change, the above test case produces an assertion failure
> with your patches.
>
> 2- when deciding if the two-phase commit is required or not in
> FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
> 2PC when we have at least one server capable of doing that.
>
> i.e
>
> For FOREIGN_TWOPHASE_COMMIT_PREFER case in
> checkForeignTwophaseCommitRequired() function I think
> the condition should be
>
> need_twophase_commit = (nserverstwophase >= 1);
> instead of
> need_twophase_commit = (nserverstwophase >= 2);
>

Hmm I might be missing your point but it seems to me that you want to
use two-phase commit even in the case where a transaction modified
data on only one server. Can't we commit distributed transaction
atomically even using one-phase commit in that case?

 
I think you are confusing between nserverstwophase and nserverswritten.

need_twophase_commit = (nserverstwophase >= 1)  would mean
use two-phase commit if at least one server exists in the list that is
capable of doing 2PC

For the case when the transaction modified data on only one server we
already exits the function indicating no two-phase required

    if (nserverswritten <= 1)
      return false;


 
Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Regards,
...
Muhammad Usama
Highgo Software (Canada/China/Pakistan) 
ADDR: 10318 WHALLEY BLVD, Surrey, BC 

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> > Hi Sawada,
>> >
>> > I have just done some review and testing of the patches and have
>> > a couple of comments.
>>
>> Thank you for reviewing!
>>
>> >
>> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> > when the transaction has operated on a single foreign server regardless
>> > of foreign_twophase_commit setting, and throw an error otherwise when
>> > 2PC is not available on any of the data-modified servers.
>> >
>> > For example, consider the case
>> >
>> > BEGIN;
>> > INSERT INTO ft_2pc_1 VALUES(1);
>> > PREPARE TRANSACTION 'global_x1';
>> >
>> > Here since we are preparing the local transaction so we should also prepare
>> > the transaction on the foreign server even if the transaction has modified only
>> > one foreign table.
>> >
>> > What do you think?
>>
>> Good catch and I agree with you. The transaction should fail if it
>> opened a transaction on a 2pc-no-support server regardless of
>> foreign_twophase_commit. And I think we should prepare a transaction
>> on a foreign server even if it didn't modify any data on that.
>>
>> >
>> > Also without this change, the above test case produces an assertion failure
>> > with your patches.
>> >
>> > 2- when deciding if the two-phase commit is required or not in
>> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> > 2PC when we have at least one server capable of doing that.
>> >
>> > i.e
>> >
>> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> > checkForeignTwophaseCommitRequired() function I think
>> > the condition should be
>> >
>> > need_twophase_commit = (nserverstwophase >= 1);
>> > instead of
>> > need_twophase_commit = (nserverstwophase >= 2);
>> >
>>
>> Hmm I might be missing your point but it seems to me that you want to
>> use two-phase commit even in the case where a transaction modified
>> data on only one server. Can't we commit distributed transaction
>> atomically even using one-phase commit in that case?
>>
>
> I think you are confusing between nserverstwophase and nserverswritten.
>
> need_twophase_commit = (nserverstwophase >= 1)  would mean
> use two-phase commit if at least one server exists in the list that is
> capable of doing 2PC
>
> For the case when the transaction modified data on only one server we
> already exits the function indicating no two-phase required
>
>     if (nserverswritten <= 1)
>       return false;
>

Thank you for your explanation. If the transaction modified two
servers that don't' support 2pc and one server that supports 2pc I
think we don't want to use 2pc even in 'prefer' case. Because even if
we use 2pc in that case, it's still possible to have the atomic commit
problem. For example, if we failed to commit a transaction after
committing other transactions on the server that doesn't support 2pc
we cannot rollback the already-committed transaction.

On the other hand, in 'prefer' case, if the transaction also modified
the local data, we need to use 2pc even if it modified data on only
one foreign server that supports 2pc. But the current code doesn't
work fine in that case for now. Probably we also need the following
change:

@@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)

    /* Did we modify the local non-temporary data? */
    if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+   {
        nserverswritten++;
+       nserverstwophase++;
+   }

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> > Hi Sawada,
>> >
>> > I have just done some review and testing of the patches and have
>> > a couple of comments.
>>
>> Thank you for reviewing!
>>
>> >
>> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> > when the transaction has operated on a single foreign server regardless
>> > of foreign_twophase_commit setting, and throw an error otherwise when
>> > 2PC is not available on any of the data-modified servers.
>> >
>> > For example, consider the case
>> >
>> > BEGIN;
>> > INSERT INTO ft_2pc_1 VALUES(1);
>> > PREPARE TRANSACTION 'global_x1';
>> >
>> > Here since we are preparing the local transaction so we should also prepare
>> > the transaction on the foreign server even if the transaction has modified only
>> > one foreign table.
>> >
>> > What do you think?
>>
>> Good catch and I agree with you. The transaction should fail if it
>> opened a transaction on a 2pc-no-support server regardless of
>> foreign_twophase_commit. And I think we should prepare a transaction
>> on a foreign server even if it didn't modify any data on that.
>>
>> >
>> > Also without this change, the above test case produces an assertion failure
>> > with your patches.
>> >
>> > 2- when deciding if the two-phase commit is required or not in
>> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> > 2PC when we have at least one server capable of doing that.
>> >
>> > i.e
>> >
>> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> > checkForeignTwophaseCommitRequired() function I think
>> > the condition should be
>> >
>> > need_twophase_commit = (nserverstwophase >= 1);
>> > instead of
>> > need_twophase_commit = (nserverstwophase >= 2);
>> >
>>
>> Hmm I might be missing your point but it seems to me that you want to
>> use two-phase commit even in the case where a transaction modified
>> data on only one server. Can't we commit distributed transaction
>> atomically even using one-phase commit in that case?
>>
>
> I think you are confusing between nserverstwophase and nserverswritten.
>
> need_twophase_commit = (nserverstwophase >= 1)  would mean
> use two-phase commit if at least one server exists in the list that is
> capable of doing 2PC
>
> For the case when the transaction modified data on only one server we
> already exits the function indicating no two-phase required
>
>     if (nserverswritten <= 1)
>       return false;
>

Thank you for your explanation. If the transaction modified two
servers that don't' support 2pc and one server that supports 2pc I
think we don't want to use 2pc even in 'prefer' case. Because even if
we use 2pc in that case, it's still possible to have the atomic commit
problem. For example, if we failed to commit a transaction after
committing other transactions on the server that doesn't support 2pc
we cannot rollback the already-committed transaction.

Yes, that is true, And I think the 'prefer' mode will always have a corner case
no matter what. But the thing is we can reduce the probability of hitting
an atomic commit problem by ensuring to use 2PC whenever possible.

For instance as in your example scenario where a transaction modified
two servers that don't support 2PC and one server that supports it. let us
analyze both scenarios.

If we use 2PC on the server that supports it then the probability of hitting
a problem would be 1/3 = 0.33. because there is only one corner case
scenario in that case. which would be if we fail to commit the third server
As the first server (2PC supported one) would be using prepared
transactions so no problem there. The second server (NON-2PC support)
if failed to commit then, still no problem as we can rollback the prepared
transaction on the first server. The only issue would happen when we fail
to commit on the third server because we have already committed
on the second server and there is no way to undo that.


Now consider the other possibility if we do not use the 2PC in that
case (as you mentioned), then the probability of hitting the problem
would be 2/3 = 0.66. because now commit failure on either second or
third server will land us in an atomic-commit-problem.

So, INMO using the 2PC whenever available with 'prefer' mode
should be the way to go.


On the other hand, in 'prefer' case, if the transaction also modified
the local data, we need to use 2pc even if it modified data on only
one foreign server that supports 2pc. But the current code doesn't
work fine in that case for now. Probably we also need the following
change:

@@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)

    /* Did we modify the local non-temporary data? */
    if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+   {
        nserverswritten++;
+       nserverstwophase++;
+   }


I agree with the part that if the transaction also modifies the local data
then the 2PC should be used.
Though the change you suggested  [+       nserverstwophase++;]
would server the purpose and deliver the same results but I think a
better way would be to change need_twophase_commit condition for
prefer mode.


      * In 'prefer' case, we prepare transactions on only servers that
      * capable of two-phase commit.
      */
-     need_twophase_commit = (nserverstwophase >= 2);
+    need_twophase_commit = (nserverstwophase >= 1);
      }


The reason I am saying that is. Currently, we do not use 2PC on the local server

in case of distributed transactions, so we should also not count the local server
as one (servers that would be performing the 2PC).
Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
looks more in line with the definition of our 'prefer' mode algorithm.

Do you see an issue with this change?
  
Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Regards,
...
Muhammad Usama
Highgo Software (Canada/China/Pakistan) 
ADDR: 10318 WHALLEY BLVD, Surrey, BC 

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> > Hi Sawada,
>> >> >
>> >> > I have just done some review and testing of the patches and have
>> >> > a couple of comments.
>> >>
>> >> Thank you for reviewing!
>> >>
>> >> >
>> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> > when the transaction has operated on a single foreign server regardless
>> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> > 2PC is not available on any of the data-modified servers.
>> >> >
>> >> > For example, consider the case
>> >> >
>> >> > BEGIN;
>> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> > PREPARE TRANSACTION 'global_x1';
>> >> >
>> >> > Here since we are preparing the local transaction so we should also prepare
>> >> > the transaction on the foreign server even if the transaction has modified only
>> >> > one foreign table.
>> >> >
>> >> > What do you think?
>> >>
>> >> Good catch and I agree with you. The transaction should fail if it
>> >> opened a transaction on a 2pc-no-support server regardless of
>> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> on a foreign server even if it didn't modify any data on that.
>> >>
>> >> >
>> >> > Also without this change, the above test case produces an assertion failure
>> >> > with your patches.
>> >> >
>> >> > 2- when deciding if the two-phase commit is required or not in
>> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> > 2PC when we have at least one server capable of doing that.
>> >> >
>> >> > i.e
>> >> >
>> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> > checkForeignTwophaseCommitRequired() function I think
>> >> > the condition should be
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> > instead of
>> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >
>> >>
>> >> Hmm I might be missing your point but it seems to me that you want to
>> >> use two-phase commit even in the case where a transaction modified
>> >> data on only one server. Can't we commit distributed transaction
>> >> atomically even using one-phase commit in that case?
>> >>
>> >
>> > I think you are confusing between nserverstwophase and nserverswritten.
>> >
>> > need_twophase_commit = (nserverstwophase >= 1)  would mean
>> > use two-phase commit if at least one server exists in the list that is
>> > capable of doing 2PC
>> >
>> > For the case when the transaction modified data on only one server we
>> > already exits the function indicating no two-phase required
>> >
>> >     if (nserverswritten <= 1)
>> >       return false;
>> >
>>
>> Thank you for your explanation. If the transaction modified two
>> servers that don't' support 2pc and one server that supports 2pc I
>> think we don't want to use 2pc even in 'prefer' case. Because even if
>> we use 2pc in that case, it's still possible to have the atomic commit
>> problem. For example, if we failed to commit a transaction after
>> committing other transactions on the server that doesn't support 2pc
>> we cannot rollback the already-committed transaction.
>
>
> Yes, that is true, And I think the 'prefer' mode will always have a corner case
> no matter what. But the thing is we can reduce the probability of hitting
> an atomic commit problem by ensuring to use 2PC whenever possible.
>
> For instance as in your example scenario where a transaction modified
> two servers that don't support 2PC and one server that supports it. let us
> analyze both scenarios.
>
> If we use 2PC on the server that supports it then the probability of hitting
> a problem would be 1/3 = 0.33. because there is only one corner case
> scenario in that case. which would be if we fail to commit the third server
> As the first server (2PC supported one) would be using prepared
> transactions so no problem there. The second server (NON-2PC support)
> if failed to commit then, still no problem as we can rollback the prepared
> transaction on the first server. The only issue would happen when we fail
> to commit on the third server because we have already committed
> on the second server and there is no way to undo that.
>
>
> Now consider the other possibility if we do not use the 2PC in that
> case (as you mentioned), then the probability of hitting the problem
> would be 2/3 = 0.66. because now commit failure on either second or
> third server will land us in an atomic-commit-problem.
>
> So, INMO using the 2PC whenever available with 'prefer' mode
> should be the way to go.

My understanding of 'prefer' mode is that even if a distributed
transaction modified data on several types of server we can ensure to
keep data consistent among only the local server and foreign servers
that support 2pc. It doesn't ensure anything for other servers that
don't support 2pc. Therefore we use 2pc if the transaction modifies
data on two or more servers that either the local node or servers that
support 2pc.

I understand your argument that using 2pc in that case the possibility
of hitting a problem can decrease but one point we need to consider is
2pc is very high cost. I think basically most users don’t want to use
2pc as much as possible. Please note that it might not work as the
user expected because users cannot specify the commit order and
particular servers might be unstable. I'm not sure that users want to
pay high costs under such conditions. If we want to decrease that
possibility by using 2pc as much as possible, I think it can be yet
another mode so that the user can choose the trade-off.

>
>>
>> On the other hand, in 'prefer' case, if the transaction also modified
>> the local data, we need to use 2pc even if it modified data on only
>> one foreign server that supports 2pc. But the current code doesn't
>> work fine in that case for now. Probably we also need the following
>> change:
>>
>> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>>
>>     /* Did we modify the local non-temporary data? */
>>     if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> +   {
>>         nserverswritten++;
>> +       nserverstwophase++;
>> +   }
>>
>
> I agree with the part that if the transaction also modifies the local data
> then the 2PC should be used.
> Though the change you suggested  [+       nserverstwophase++;]
> would server the purpose and deliver the same results but I think a
> better way would be to change need_twophase_commit condition for
> prefer mode.
>
>
>       * In 'prefer' case, we prepare transactions on only servers that
>       * capable of two-phase commit.
>       */
> -     need_twophase_commit = (nserverstwophase >= 2);
> +    need_twophase_commit = (nserverstwophase >= 1);
>       }
>
>
> The reason I am saying that is. Currently, we do not use 2PC on the local server
> in case of distributed transactions, so we should also not count the local server
> as one (servers that would be performing the 2PC).
> Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
> looks more in line with the definition of our 'prefer' mode algorithm.
>
> Do you see an issue with this change?

I think that with my change we will use 2pc in the case where a
transaction modified data on the local node and one server that
supports 2pc. But with your change, we will use 2pc in more cases, in
addition to the case where a transaction modifies the local and one
2pc-support server. This would fit the definition of 'prefer' you
described but it's still unclear to me that it's better to make
'prefer' mode behave so if we have three values: 'required', 'prefer'
and 'disabled'.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Fri, May 15, 2020 at 7:52 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> > Hi Sawada,
>> >> >
>> >> > I have just done some review and testing of the patches and have
>> >> > a couple of comments.
>> >>
>> >> Thank you for reviewing!
>> >>
>> >> >
>> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> > when the transaction has operated on a single foreign server regardless
>> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> > 2PC is not available on any of the data-modified servers.
>> >> >
>> >> > For example, consider the case
>> >> >
>> >> > BEGIN;
>> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> > PREPARE TRANSACTION 'global_x1';
>> >> >
>> >> > Here since we are preparing the local transaction so we should also prepare
>> >> > the transaction on the foreign server even if the transaction has modified only
>> >> > one foreign table.
>> >> >
>> >> > What do you think?
>> >>
>> >> Good catch and I agree with you. The transaction should fail if it
>> >> opened a transaction on a 2pc-no-support server regardless of
>> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> on a foreign server even if it didn't modify any data on that.
>> >>
>> >> >
>> >> > Also without this change, the above test case produces an assertion failure
>> >> > with your patches.
>> >> >
>> >> > 2- when deciding if the two-phase commit is required or not in
>> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> > 2PC when we have at least one server capable of doing that.
>> >> >
>> >> > i.e
>> >> >
>> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> > checkForeignTwophaseCommitRequired() function I think
>> >> > the condition should be
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> > instead of
>> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >
>> >>
>> >> Hmm I might be missing your point but it seems to me that you want to
>> >> use two-phase commit even in the case where a transaction modified
>> >> data on only one server. Can't we commit distributed transaction
>> >> atomically even using one-phase commit in that case?
>> >>
>> >
>> > I think you are confusing between nserverstwophase and nserverswritten.
>> >
>> > need_twophase_commit = (nserverstwophase >= 1)  would mean
>> > use two-phase commit if at least one server exists in the list that is
>> > capable of doing 2PC
>> >
>> > For the case when the transaction modified data on only one server we
>> > already exits the function indicating no two-phase required
>> >
>> >     if (nserverswritten <= 1)
>> >       return false;
>> >
>>
>> Thank you for your explanation. If the transaction modified two
>> servers that don't' support 2pc and one server that supports 2pc I
>> think we don't want to use 2pc even in 'prefer' case. Because even if
>> we use 2pc in that case, it's still possible to have the atomic commit
>> problem. For example, if we failed to commit a transaction after
>> committing other transactions on the server that doesn't support 2pc
>> we cannot rollback the already-committed transaction.
>
>
> Yes, that is true, And I think the 'prefer' mode will always have a corner case
> no matter what. But the thing is we can reduce the probability of hitting
> an atomic commit problem by ensuring to use 2PC whenever possible.
>
> For instance as in your example scenario where a transaction modified
> two servers that don't support 2PC and one server that supports it. let us
> analyze both scenarios.
>
> If we use 2PC on the server that supports it then the probability of hitting
> a problem would be 1/3 = 0.33. because there is only one corner case
> scenario in that case. which would be if we fail to commit the third server
> As the first server (2PC supported one) would be using prepared
> transactions so no problem there. The second server (NON-2PC support)
> if failed to commit then, still no problem as we can rollback the prepared
> transaction on the first server. The only issue would happen when we fail
> to commit on the third server because we have already committed
> on the second server and there is no way to undo that.
>
>
> Now consider the other possibility if we do not use the 2PC in that
> case (as you mentioned), then the probability of hitting the problem
> would be 2/3 = 0.66. because now commit failure on either second or
> third server will land us in an atomic-commit-problem.
>
> So, INMO using the 2PC whenever available with 'prefer' mode
> should be the way to go.

My understanding of 'prefer' mode is that even if a distributed
transaction modified data on several types of server we can ensure to
keep data consistent among only the local server and foreign servers
that support 2pc. It doesn't ensure anything for other servers that
don't support 2pc. Therefore we use 2pc if the transaction modifies
data on two or more servers that either the local node or servers that
support 2pc.

I understand your argument that using 2pc in that case the possibility
of hitting a problem can decrease but one point we need to consider is
2pc is very high cost. I think basically most users don’t want to use
2pc as much as possible. Please note that it might not work as the
user expected because users cannot specify the commit order and
particular servers might be unstable. I'm not sure that users want to
pay high costs under such conditions. If we want to decrease that
possibility by using 2pc as much as possible, I think it can be yet
another mode so that the user can choose the trade-off.

>
>>
>> On the other hand, in 'prefer' case, if the transaction also modified
>> the local data, we need to use 2pc even if it modified data on only
>> one foreign server that supports 2pc. But the current code doesn't
>> work fine in that case for now. Probably we also need the following
>> change:
>>
>> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>>
>>     /* Did we modify the local non-temporary data? */
>>     if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> +   {
>>         nserverswritten++;
>> +       nserverstwophase++;
>> +   }
>>
>
> I agree with the part that if the transaction also modifies the local data
> then the 2PC should be used.
> Though the change you suggested  [+       nserverstwophase++;]
> would server the purpose and deliver the same results but I think a
> better way would be to change need_twophase_commit condition for
> prefer mode.
>
>
>       * In 'prefer' case, we prepare transactions on only servers that
>       * capable of two-phase commit.
>       */
> -     need_twophase_commit = (nserverstwophase >= 2);
> +    need_twophase_commit = (nserverstwophase >= 1);
>       }
>
>
> The reason I am saying that is. Currently, we do not use 2PC on the local server
> in case of distributed transactions, so we should also not count the local server
> as one (servers that would be performing the 2PC).
> Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
> looks more in line with the definition of our 'prefer' mode algorithm.
>
> Do you see an issue with this change?

I think that with my change we will use 2pc in the case where a
transaction modified data on the local node and one server that
supports 2pc. But with your change, we will use 2pc in more cases, in
addition to the case where a transaction modifies the local and one
2pc-support server. This would fit the definition of 'prefer' you
described but it's still unclear to me that it's better to make
'prefer' mode behave so if we have three values: 'required', 'prefer'
and 'disabled'.


Thanks for the detailed explanation, now I have a better understanding of the
reasons why we were going for a different solution to the problem.
You are right my understanding of 'prefer' mode is we must use 2PC as much
as possible, and reason for that was the world prefer as per my understanding
means "it's more desirable/better to use than another or others" 
So the way I understood the FOREIGN_TWOPHASE_COMMIT_PREFER
was that we would use 2PC in the maximum possible of cases, and the user
would already have the expectation that 2PC is more expensive than 1PC.



Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Regards,
...
Muhammad Usama
Highgo Software (Canada/China/Pakistan) 
ADDR: 10318 WHALLEY BLVD, Surrey, BC 

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, 16 May 2020 at 00:54, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:52 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >> >>
>> >> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> > Hi Sawada,
>> >> >> >
>> >> >> > I have just done some review and testing of the patches and have
>> >> >> > a couple of comments.
>> >> >>
>> >> >> Thank you for reviewing!
>> >> >>
>> >> >> >
>> >> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> >> > when the transaction has operated on a single foreign server regardless
>> >> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> >> > 2PC is not available on any of the data-modified servers.
>> >> >> >
>> >> >> > For example, consider the case
>> >> >> >
>> >> >> > BEGIN;
>> >> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> >> > PREPARE TRANSACTION 'global_x1';
>> >> >> >
>> >> >> > Here since we are preparing the local transaction so we should also prepare
>> >> >> > the transaction on the foreign server even if the transaction has modified only
>> >> >> > one foreign table.
>> >> >> >
>> >> >> > What do you think?
>> >> >>
>> >> >> Good catch and I agree with you. The transaction should fail if it
>> >> >> opened a transaction on a 2pc-no-support server regardless of
>> >> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> >> on a foreign server even if it didn't modify any data on that.
>> >> >>
>> >> >> >
>> >> >> > Also without this change, the above test case produces an assertion failure
>> >> >> > with your patches.
>> >> >> >
>> >> >> > 2- when deciding if the two-phase commit is required or not in
>> >> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> >> > 2PC when we have at least one server capable of doing that.
>> >> >> >
>> >> >> > i.e
>> >> >> >
>> >> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> >> > checkForeignTwophaseCommitRequired() function I think
>> >> >> > the condition should be
>> >> >> >
>> >> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> >> > instead of
>> >> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >> >
>> >> >>
>> >> >> Hmm I might be missing your point but it seems to me that you want to
>> >> >> use two-phase commit even in the case where a transaction modified
>> >> >> data on only one server. Can't we commit distributed transaction
>> >> >> atomically even using one-phase commit in that case?
>> >> >>
>> >> >
>> >> > I think you are confusing between nserverstwophase and nserverswritten.
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1)  would mean
>> >> > use two-phase commit if at least one server exists in the list that is
>> >> > capable of doing 2PC
>> >> >
>> >> > For the case when the transaction modified data on only one server we
>> >> > already exits the function indicating no two-phase required
>> >> >
>> >> >     if (nserverswritten <= 1)
>> >> >       return false;
>> >> >
>> >>
>> >> Thank you for your explanation. If the transaction modified two
>> >> servers that don't' support 2pc and one server that supports 2pc I
>> >> think we don't want to use 2pc even in 'prefer' case. Because even if
>> >> we use 2pc in that case, it's still possible to have the atomic commit
>> >> problem. For example, if we failed to commit a transaction after
>> >> committing other transactions on the server that doesn't support 2pc
>> >> we cannot rollback the already-committed transaction.
>> >
>> >
>> > Yes, that is true, And I think the 'prefer' mode will always have a corner case
>> > no matter what. But the thing is we can reduce the probability of hitting
>> > an atomic commit problem by ensuring to use 2PC whenever possible.
>> >
>> > For instance as in your example scenario where a transaction modified
>> > two servers that don't support 2PC and one server that supports it. let us
>> > analyze both scenarios.
>> >
>> > If we use 2PC on the server that supports it then the probability of hitting
>> > a problem would be 1/3 = 0.33. because there is only one corner case
>> > scenario in that case. which would be if we fail to commit the third server
>> > As the first server (2PC supported one) would be using prepared
>> > transactions so no problem there. The second server (NON-2PC support)
>> > if failed to commit then, still no problem as we can rollback the prepared
>> > transaction on the first server. The only issue would happen when we fail
>> > to commit on the third server because we have already committed
>> > on the second server and there is no way to undo that.
>> >
>> >
>> > Now consider the other possibility if we do not use the 2PC in that
>> > case (as you mentioned), then the probability of hitting the problem
>> > would be 2/3 = 0.66. because now commit failure on either second or
>> > third server will land us in an atomic-commit-problem.
>> >
>> > So, INMO using the 2PC whenever available with 'prefer' mode
>> > should be the way to go.
>>
>> My understanding of 'prefer' mode is that even if a distributed
>> transaction modified data on several types of server we can ensure to
>> keep data consistent among only the local server and foreign servers
>> that support 2pc. It doesn't ensure anything for other servers that
>> don't support 2pc. Therefore we use 2pc if the transaction modifies
>> data on two or more servers that either the local node or servers that
>> support 2pc.
>>
>> I understand your argument that using 2pc in that case the possibility
>> of hitting a problem can decrease but one point we need to consider is
>> 2pc is very high cost. I think basically most users don’t want to use
>> 2pc as much as possible. Please note that it might not work as the
>> user expected because users cannot specify the commit order and
>> particular servers might be unstable. I'm not sure that users want to
>> pay high costs under such conditions. If we want to decrease that
>> possibility by using 2pc as much as possible, I think it can be yet
>> another mode so that the user can choose the trade-off.
>>
>> >
>> >>
>> >> On the other hand, in 'prefer' case, if the transaction also modified
>> >> the local data, we need to use 2pc even if it modified data on only
>> >> one foreign server that supports 2pc. But the current code doesn't
>> >> work fine in that case for now. Probably we also need the following
>> >> change:
>> >>
>> >> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>> >>
>> >>     /* Did we modify the local non-temporary data? */
>> >>     if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> >> +   {
>> >>         nserverswritten++;
>> >> +       nserverstwophase++;
>> >> +   }
>> >>
>> >
>> > I agree with the part that if the transaction also modifies the local data
>> > then the 2PC should be used.
>> > Though the change you suggested  [+       nserverstwophase++;]
>> > would server the purpose and deliver the same results but I think a
>> > better way would be to change need_twophase_commit condition for
>> > prefer mode.
>> >
>> >
>> >       * In 'prefer' case, we prepare transactions on only servers that
>> >       * capable of two-phase commit.
>> >       */
>> > -     need_twophase_commit = (nserverstwophase >= 2);
>> > +    need_twophase_commit = (nserverstwophase >= 1);
>> >       }
>> >
>> >
>> > The reason I am saying that is. Currently, we do not use 2PC on the local server
>> > in case of distributed transactions, so we should also not count the local server
>> > as one (servers that would be performing the 2PC).
>> > Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
>> > looks more in line with the definition of our 'prefer' mode algorithm.
>> >
>> > Do you see an issue with this change?
>>
>> I think that with my change we will use 2pc in the case where a
>> transaction modified data on the local node and one server that
>> supports 2pc. But with your change, we will use 2pc in more cases, in
>> addition to the case where a transaction modifies the local and one
>> 2pc-support server. This would fit the definition of 'prefer' you
>> described but it's still unclear to me that it's better to make
>> 'prefer' mode behave so if we have three values: 'required', 'prefer'
>> and 'disabled'.
>>
>
> Thanks for the detailed explanation, now I have a better understanding of the
> reasons why we were going for a different solution to the problem.
> You are right my understanding of 'prefer' mode is we must use 2PC as much
> as possible, and reason for that was the world prefer as per my understanding
> means "it's more desirable/better to use than another or others"
> So the way I understood the FOREIGN_TWOPHASE_COMMIT_PREFER
> was that we would use 2PC in the maximum possible of cases, and the user
> would already have the expectation that 2PC is more expensive than 1PC.
>

I think that the current three values are useful for users. The
‘required’ mode is used when users want to ensure all writes involved
with the transaction are committed atomically. That being said, as
some FDW plugin might not support the prepare API we cannot force
users to use this mode all the time when using atomic commit.
Therefore ‘prefer’ mode would be useful for this case. Both modes use
2pc only when it's required for atomic commit.

So what do you think my idea that adding the behavior you proposed as
another new mode? As it’s better to keep the first version simple as
much as possible It might not be added to the first version but this
behavior might be useful in some cases.

I've attached a new version patch that incorporates some bug fixes
reported by Muhammad. Please review them.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Tue, May 19, 2020 at 12:33 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> I think that the current three values are useful for users. The
> ‘required’ mode is used when users want to ensure all writes involved
> with the transaction are committed atomically. That being said, as
> some FDW plugin might not support the prepare API we cannot force
> users to use this mode all the time when using atomic commit.
> Therefore ‘prefer’ mode would be useful for this case. Both modes use
> 2pc only when it's required for atomic commit.
>
> So what do you think my idea that adding the behavior you proposed as
> another new mode? As it’s better to keep the first version simple as
> much as possible
>

If the intention is to keep the first version simple, then why do we
want to support any mode other than 'required'?  I think it will limit
its usage for the cases where 2PC can be used only when all FDWs
involved support Prepare API but if that helps to keep the design and
patch simpler then why not just do that for the first version and then
extend it later.  OTOH, if you think it will be really useful to keep
other modes, then also we could try to keep those in separate patches
to facilitate the review and discussion of the core feature.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 3 Jun 2020 at 14:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, May 19, 2020 at 12:33 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > I think that the current three values are useful for users. The
> > ‘required’ mode is used when users want to ensure all writes involved
> > with the transaction are committed atomically. That being said, as
> > some FDW plugin might not support the prepare API we cannot force
> > users to use this mode all the time when using atomic commit.
> > Therefore ‘prefer’ mode would be useful for this case. Both modes use
> > 2pc only when it's required for atomic commit.
> >
> > So what do you think my idea that adding the behavior you proposed as
> > another new mode? As it’s better to keep the first version simple as
> > much as possible
> >
>
> If the intention is to keep the first version simple, then why do we
> want to support any mode other than 'required'?  I think it will limit
> its usage for the cases where 2PC can be used only when all FDWs
> involved support Prepare API but if that helps to keep the design and
> patch simpler then why not just do that for the first version and then
> extend it later.  OTOH, if you think it will be really useful to keep
> other modes, then also we could try to keep those in separate patches
> to facilitate the review and discussion of the core feature.

‘disabled’ is the fundamental mode. We also need 'disabled' mode,
otherwise existing FDW won't work. I was concerned that many FDW
plugins don't implement FDW transaction APIs yet when users start
using this feature. But it seems to be a good idea to move 'prefer'
mode to a separate patch while leaving 'required'. I'll do that in the
next version patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Wed, Jun 3, 2020 at 12:02 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 3 Jun 2020 at 14:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > If the intention is to keep the first version simple, then why do we
> > want to support any mode other than 'required'?  I think it will limit
> > its usage for the cases where 2PC can be used only when all FDWs
> > involved support Prepare API but if that helps to keep the design and
> > patch simpler then why not just do that for the first version and then
> > extend it later.  OTOH, if you think it will be really useful to keep
> > other modes, then also we could try to keep those in separate patches
> > to facilitate the review and discussion of the core feature.
>
> ‘disabled’ is the fundamental mode. We also need 'disabled' mode,
> otherwise existing FDW won't work.
>

IIUC, if foreign_twophase_commit is 'disabled', we don't use a
two-phase protocol to commit distributed transactions, right?  So, do
we check this at the time of Prepare or Commit whether we need to use
a two-phase protocol?  I think this should be checked at prepare time.

+        <para>
+         This parameter can be changed at any time; the behavior for any one
+         transaction is determined by the setting in effect when it commits.
+        </para>

This is written w.r.t foreign_twophase_commit.  If one changes this
between prepare and commit, will it have any impact?

>  I was concerned that many FDW
> plugins don't implement FDW transaction APIs yet when users start
> using this feature. But it seems to be a good idea to move 'prefer'
> mode to a separate patch while leaving 'required'. I'll do that in the
> next version patch.
>

Okay, thanks.  Please, see if you can separate out the documentation
for that as well.

Few other comments on v21-0003-Documentation-update:
----------------------------------------------------
1.
+      <entry></entry>
+      <entry>
+       Numeric transaction identifier with that this foreign transaction
+       associates
+      </entry>

/with that this/with which this

2.
+      <entry>
+       The OID of the foreign server on that the foreign transaction
is prepared
+      </entry>

/on that the/on which the

3.
+      <entry><structfield>status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       Status of foreign transaction. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>initial</literal> : Initial status.
+         </para>

What exactly "Initial status" means?

4.
+      <entry><structfield>in_doubt</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       If <literal>true</literal> this foreign transaction is
in-doubt status and
+       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
+       function.
+      </entry>

It would be better if you can add an additional sentence to say when
and or how can foreign transactions reach in-doubt state.

5.
If <literal>N</literal> local transactions each
+         across <literal>K</literal> foreign server this value need to be set

This part of the sentence can be improved by saying something like:
"If a user expects N local transactions and each of those involves K
foreign servers, this value..".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 4 Jun 2020 at 12:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 3, 2020 at 12:02 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 3 Jun 2020 at 14:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > If the intention is to keep the first version simple, then why do we
> > > want to support any mode other than 'required'?  I think it will limit
> > > its usage for the cases where 2PC can be used only when all FDWs
> > > involved support Prepare API but if that helps to keep the design and
> > > patch simpler then why not just do that for the first version and then
> > > extend it later.  OTOH, if you think it will be really useful to keep
> > > other modes, then also we could try to keep those in separate patches
> > > to facilitate the review and discussion of the core feature.
> >
> > ‘disabled’ is the fundamental mode.

Oops, I wanted to say 'required' is the fundamental mode.

> > We also need 'disabled' mode,
> > otherwise existing FDW won't work.
> >
>
> IIUC, if foreign_twophase_commit is 'disabled', we don't use a
> two-phase protocol to commit distributed transactions, right?  So, do
> we check this at the time of Prepare or Commit whether we need to use
> a two-phase protocol?  I think this should be checked at prepare time.

When a client executes COMMIT to a distributed transaction, 2pc is
automatically, transparently used. In ‘required’ case, all involved
(and modified) foreign server needs to support 2pc. So if a
distributed transaction modifies data on a foreign server connected
via an existing FDW which doesn’t support 2pc, the transaction cannot
proceed commit, fails at pre-commit phase. So there should be two
modes: ‘disabled’ and ‘required’, and should be ‘disabled’ by default.

>
> +        <para>
> +         This parameter can be changed at any time; the behavior for any one
> +         transaction is determined by the setting in effect when it commits.
> +        </para>
>
> This is written w.r.t foreign_twophase_commit.  If one changes this
> between prepare and commit, will it have any impact?

Since the distributed transaction commit automatically uses 2pc when
executing COMMIT, it's not possible to change foreign_twophase_commit
between prepare and commit. So I'd like to explain the case where a
user executes PREPARE and then COMMIT PREPARED while changing
foreign_twophase_commit.

PREPARE can run only when foreign_twophase_commit is 'required' (or
'prefer') and all foreign servers involved with the transaction
support 2pc. We prepare all foreign transactions no matter what the
number of servers and modified or not. If either
foreign_twophase_commit is 'disabled' or the transaction modifies data
on a foreign server that doesn't support 2pc, it raises an error. At
COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
to be set to 'required'. It raises an error if the distributed
transaction has a foreign transaction and foreign_twophase_commit is
'disabled'.

>
> >  I was concerned that many FDW
> > plugins don't implement FDW transaction APIs yet when users start
> > using this feature. But it seems to be a good idea to move 'prefer'
> > mode to a separate patch while leaving 'required'. I'll do that in the
> > next version patch.
> >
>
> Okay, thanks.  Please, see if you can separate out the documentation
> for that as well.
>
> Few other comments on v21-0003-Documentation-update:
> ----------------------------------------------------
> 1.
> +      <entry></entry>
> +      <entry>
> +       Numeric transaction identifier with that this foreign transaction
> +       associates
> +      </entry>
>
> /with that this/with which this
>
> 2.
> +      <entry>
> +       The OID of the foreign server on that the foreign transaction
> is prepared
> +      </entry>
>
> /on that the/on which the
>
> 3.
> +      <entry><structfield>status</structfield></entry>
> +      <entry><type>text</type></entry>
> +      <entry></entry>
> +      <entry>
> +       Status of foreign transaction. Possible values are:
> +       <itemizedlist>
> +        <listitem>
> +         <para>
> +          <literal>initial</literal> : Initial status.
> +         </para>
>
> What exactly "Initial status" means?

This part is out-of-date. Fixed.

>
> 4.
> +      <entry><structfield>in_doubt</structfield></entry>
> +      <entry><type>boolean</type></entry>
> +      <entry></entry>
> +      <entry>
> +       If <literal>true</literal> this foreign transaction is
> in-doubt status and
> +       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
> +       function.
> +      </entry>
>
> It would be better if you can add an additional sentence to say when
> and or how can foreign transactions reach in-doubt state.
>
> 5.
> If <literal>N</literal> local transactions each
> +         across <literal>K</literal> foreign server this value need to be set
>
> This part of the sentence can be improved by saying something like:
> "If a user expects N local transactions and each of those involves K
> foreign servers, this value..".

Thanks. I've incorporated all your comments.

I've attached the new version patch set. 0006 is a separate patch
which introduces 'prefer' mode to foreign_twophase_commit.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 5, 2020 at 3:16 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 4 Jun 2020 at 12:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > +        <para>
> > +         This parameter can be changed at any time; the behavior for any one
> > +         transaction is determined by the setting in effect when it commits.
> > +        </para>
> >
> > This is written w.r.t foreign_twophase_commit.  If one changes this
> > between prepare and commit, will it have any impact?
>
> Since the distributed transaction commit automatically uses 2pc when
> executing COMMIT, it's not possible to change foreign_twophase_commit
> between prepare and commit. So I'd like to explain the case where a
> user executes PREPARE and then COMMIT PREPARED while changing
> foreign_twophase_commit.
>
> PREPARE can run only when foreign_twophase_commit is 'required' (or
> 'prefer') and all foreign servers involved with the transaction
> support 2pc. We prepare all foreign transactions no matter what the
> number of servers and modified or not. If either
> foreign_twophase_commit is 'disabled' or the transaction modifies data
> on a foreign server that doesn't support 2pc, it raises an error. At
> COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
> to be set to 'required'. It raises an error if the distributed
> transaction has a foreign transaction and foreign_twophase_commit is
> 'disabled'.
>

So, IIUC, it will raise an error if foreign_twophase_commit is
'disabled' (or one of the foreign server involved doesn't support 2PC)
and the error can be raised both when user issues PREPARE or COMMIT
(or ROLLBACK) PREPARED.  If so, isn't it strange that we raise such an
error after PREPARE?  What kind of use-case required this?

>
> >
> > 4.
> > +      <entry><structfield>in_doubt</structfield></entry>
> > +      <entry><type>boolean</type></entry>
> > +      <entry></entry>
> > +      <entry>
> > +       If <literal>true</literal> this foreign transaction is
> > in-doubt status and
> > +       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
> > +       function.
> > +      </entry>
> >
> > It would be better if you can add an additional sentence to say when
> > and or how can foreign transactions reach in-doubt state.
> >

+       If <literal>true</literal> this foreign transaction is in-doubt status.
+       A foreign transaction becomes in-doubt status when user canceled the
+       query during transaction commit or the server crashed during transaction
+       commit.

Can we reword the second sentence as: "A foreign transaction can have
this status when the user has cancelled the statement or the server
crashes during transaction commit."?   I have another question about
this field, why can't it be one of the status ('preparing',
'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
separate field?  Also, isn't it more suitable to name 'status' field
as 'state' because these appear to be more like different states of
transaction?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 11 Jun 2020 at 22:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 5, 2020 at 3:16 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Thu, 4 Jun 2020 at 12:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > +        <para>
> > > +         This parameter can be changed at any time; the behavior for any one
> > > +         transaction is determined by the setting in effect when it commits.
> > > +        </para>
> > >
> > > This is written w.r.t foreign_twophase_commit.  If one changes this
> > > between prepare and commit, will it have any impact?
> >
> > Since the distributed transaction commit automatically uses 2pc when
> > executing COMMIT, it's not possible to change foreign_twophase_commit
> > between prepare and commit. So I'd like to explain the case where a
> > user executes PREPARE and then COMMIT PREPARED while changing
> > foreign_twophase_commit.
> >
> > PREPARE can run only when foreign_twophase_commit is 'required' (or
> > 'prefer') and all foreign servers involved with the transaction
> > support 2pc. We prepare all foreign transactions no matter what the
> > number of servers and modified or not. If either
> > foreign_twophase_commit is 'disabled' or the transaction modifies data
> > on a foreign server that doesn't support 2pc, it raises an error. At
> > COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
> > to be set to 'required'. It raises an error if the distributed
> > transaction has a foreign transaction and foreign_twophase_commit is
> > 'disabled'.
> >
>
> So, IIUC, it will raise an error if foreign_twophase_commit is
> 'disabled' (or one of the foreign server involved doesn't support 2PC)
> and the error can be raised both when user issues PREPARE or COMMIT
> (or ROLLBACK) PREPARED.  If so, isn't it strange that we raise such an
> error after PREPARE?  What kind of use-case required this?
>

I don’t concrete use-case but the reason why it raises an error when a
user setting foreign_twophase_commit to 'disabled' executes COMMIT (or
ROLLBACK) PREPARED within the transaction involving at least one
foreign server is that I wanted to make it behaves in a similar way of
COMMIT case. I mean, if a user executes just COMMIT, the distributed
transaction is committed in two phases but the value of
foreign_twophase_commit is not changed during these two phases. So I
wanted to require user to set foreign_twophase_commit to ‘required’
both when executing PREPARE and executing COMMIT (or ROLLBACK)
PREPARED. Implementation also can become simple because we can assume
that foreign_twophase_commit is always enabled when a transaction
requires foreign transaction preparation and resolution.

> >
> > >
> > > 4.
> > > +      <entry><structfield>in_doubt</structfield></entry>
> > > +      <entry><type>boolean</type></entry>
> > > +      <entry></entry>
> > > +      <entry>
> > > +       If <literal>true</literal> this foreign transaction is
> > > in-doubt status and
> > > +       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
> > > +       function.
> > > +      </entry>
> > >
> > > It would be better if you can add an additional sentence to say when
> > > and or how can foreign transactions reach in-doubt state.
> > >
>
> +       If <literal>true</literal> this foreign transaction is in-doubt status.
> +       A foreign transaction becomes in-doubt status when user canceled the
> +       query during transaction commit or the server crashed during transaction
> +       commit.
>
> Can we reword the second sentence as: "A foreign transaction can have
> this status when the user has cancelled the statement or the server
> crashes during transaction commit."?

Agreed. Updated in my local branch.

>  I have another question about
> this field, why can't it be one of the status ('preparing',
> 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> separate field?

Because I'm using in-doubt field also for checking if the foreign
transaction entry can also be resolved manually, i.g.
pg_resolve_foreign_xact(). For instance, a foreign transaction which
status = 'prepared' and in-doubt = 'true' can be resolved either
foreign transaction resolver or pg_resolve_foreign_xact(). When a user
execute pg_resolve_foreign_xact() against the foreign transaction, it
sets status = 'committing' (or 'rollbacking') by checking transaction
status in clog. The user might cancel pg_resolve_foreign_xact() during
resolution. In this case, the foreign transaction is still status =
'committing' and in-doubt = 'true'. Then if a foreign transaction
resolver process processes the foreign transaction, it can commit it
without clog looking.

> Also, isn't it more suitable to name 'status' field
> as 'state' because these appear to be more like different states of
> transaction?

Agreed.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 11 Jun 2020 at 22:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> >  I have another question about
> > this field, why can't it be one of the status ('preparing',
> > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > separate field?
>
> Because I'm using in-doubt field also for checking if the foreign
> transaction entry can also be resolved manually, i.g.
> pg_resolve_foreign_xact(). For instance, a foreign transaction which
> status = 'prepared' and in-doubt = 'true' can be resolved either
> foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> execute pg_resolve_foreign_xact() against the foreign transaction, it
> sets status = 'committing' (or 'rollbacking') by checking transaction
> status in clog. The user might cancel pg_resolve_foreign_xact() during
> resolution. In this case, the foreign transaction is still status =
> 'committing' and in-doubt = 'true'. Then if a foreign transaction
> resolver process processes the foreign transaction, it can commit it
> without clog looking.
>

I think this is a corner case and it is better to simplify the state
recording of foreign transactions then to save a CLOG lookup.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 12 Jun 2020 at 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Thu, 11 Jun 2020 at 22:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> >
> > >  I have another question about
> > > this field, why can't it be one of the status ('preparing',
> > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > separate field?
> >
> > Because I'm using in-doubt field also for checking if the foreign
> > transaction entry can also be resolved manually, i.g.
> > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > status = 'prepared' and in-doubt = 'true' can be resolved either
> > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > sets status = 'committing' (or 'rollbacking') by checking transaction
> > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > resolution. In this case, the foreign transaction is still status =
> > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > resolver process processes the foreign transaction, it can commit it
> > without clog looking.
> >
>
> I think this is a corner case and it is better to simplify the state
> recording of foreign transactions then to save a CLOG lookup.
>

The main usage of in-doubt flag is to distinguish between in-doubt
transactions and other transactions that have their waiter (I call
on-line transactions).  If one foreign server downs for a long time
after the server crash during distributed transaction commit, foreign
transaction resolver tries to resolve the foreign transaction but
fails because the foreign server doesn’t respond. We’d like to avoid
the situation where a resolver process always picks up that foreign
transaction and other on-online transactions waiting to be resolved
cannot move forward. Therefore, a resolver process prioritizes online
transactions. Once the shmem queue having on-line transactions becomes
empty, a resolver process looks at the array of foreign transaction
state to get in-doubt transactions to resolve. I think we should not
process both in-doubt transactions and on-line transactions in the
same way.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 9:54 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 12 Jun 2020 at 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Thu, 11 Jun 2020 at 22:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > >
> > > >  I have another question about
> > > > this field, why can't it be one of the status ('preparing',
> > > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > > separate field?
> > >
> > > Because I'm using in-doubt field also for checking if the foreign
> > > transaction entry can also be resolved manually, i.g.
> > > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > > status = 'prepared' and in-doubt = 'true' can be resolved either
> > > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > > sets status = 'committing' (or 'rollbacking') by checking transaction
> > > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > > resolution. In this case, the foreign transaction is still status =
> > > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > > resolver process processes the foreign transaction, it can commit it
> > > without clog looking.
> > >
> >
> > I think this is a corner case and it is better to simplify the state
> > recording of foreign transactions then to save a CLOG lookup.
> >
>
> The main usage of in-doubt flag is to distinguish between in-doubt
> transactions and other transactions that have their waiter (I call
> on-line transactions).
>

Which are these other online transactions?  I had assumed that foreign
transaction resolver process is to resolve in-doubt transactions but
it seems it is also used for some other purpose which anyway was the
next question I had while reviewing other sections of docs but let's
clarify as it came up now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 12 Jun 2020 at 15:37, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 9:54 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Fri, 12 Jun 2020 at 12:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > >
> > > > On Thu, 11 Jun 2020 at 22:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > >
> > > > >  I have another question about
> > > > > this field, why can't it be one of the status ('preparing',
> > > > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > > > separate field?
> > > >
> > > > Because I'm using in-doubt field also for checking if the foreign
> > > > transaction entry can also be resolved manually, i.g.
> > > > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > > > status = 'prepared' and in-doubt = 'true' can be resolved either
> > > > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > > > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > > > sets status = 'committing' (or 'rollbacking') by checking transaction
> > > > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > > > resolution. In this case, the foreign transaction is still status =
> > > > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > > > resolver process processes the foreign transaction, it can commit it
> > > > without clog looking.
> > > >
> > >
> > > I think this is a corner case and it is better to simplify the state
> > > recording of foreign transactions then to save a CLOG lookup.
> > >
> >
> > The main usage of in-doubt flag is to distinguish between in-doubt
> > transactions and other transactions that have their waiter (I call
> > on-line transactions).
> >
>
> Which are these other online transactions?  I had assumed that foreign
> transaction resolver process is to resolve in-doubt transactions but
> it seems it is also used for some other purpose which anyway was the
> next question I had while reviewing other sections of docs but let's
> clarify as it came up now.

When a distributed transaction is committed by COMMIT command, the
postgres backend process prepare all foreign transaction and commit
the local transaction. Then the backend enqueue itself to the shmem
queue, asks a resolver process for committing the prepared foreign
transaction, and wait. That is, these prepared foreign transactions
are committed by the resolver process, not backend process. Once the
resolver process committed all prepared foreign transactions, it wakes
the waiting backend process. I meant this kind of transaction is
on-line transactions. This procedure is similar to what synchronous
replication does.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 2:10 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 12 Jun 2020 at 15:37, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > >
> > > > I think this is a corner case and it is better to simplify the state
> > > > recording of foreign transactions then to save a CLOG lookup.
> > > >
> > >
> > > The main usage of in-doubt flag is to distinguish between in-doubt
> > > transactions and other transactions that have their waiter (I call
> > > on-line transactions).
> > >
> >
> > Which are these other online transactions?  I had assumed that foreign
> > transaction resolver process is to resolve in-doubt transactions but
> > it seems it is also used for some other purpose which anyway was the
> > next question I had while reviewing other sections of docs but let's
> > clarify as it came up now.
>
> When a distributed transaction is committed by COMMIT command, the
> postgres backend process prepare all foreign transaction and commit
> the local transaction.
>

Does this mean that we will mark the xid as committed in CLOG of the
local server?  If so, why is this okay till we commit transactions in
all the foreign servers, what if we fail to commit on one of the
servers?

Few more comments on v22-0003-Documentation-update
--------------------------------------------------------------------------------------
1.
+          When <literal>disabled</literal> there can be risk of database
+          consistency among all servers that involved in the distributed
+          transaction when some foreign server crashes during committing the
+          distributed transaction.

Will it read better if rephrase above to something like: "When
<literal>disabled</literal> there can be a risk of database
consistency if one or more foreign servers crashes while committing
the distributed transaction."?

2.
+      <varlistentry
id="guc-foreign-transaction-resolution-rety-interval"
xreflabel="foreign_transaction_resolution_retry_interval">
+       <term><varname>foreign_transaction_resolution_retry_interval</varname>
(<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolution_interval</varname>
configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specify how long the foreign transaction resolver should
wait when the last resolution
+         fails before retrying to resolve foreign transaction. This
parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server
command line.
+        </para>
+        <para>
+         The default value is 10 seconds.
+        </para>
+       </listitem>
+      </varlistentry>

Typo.  <varlistentry
id="guc-foreign-transaction-resolution-rety-interval", spelling of
retry is wrong.  Do we really need such a guc parameter?  I think we
can come up with some simple algorithm to retry after a few seconds
and then increase that interval of retry if we fail again or something
like that.  I don't know how users can come up with some non-default
value for this variable.

3
+      <varlistentry id="guc-foreign-transaction-resolver-timeout"
xreflabel="foreign_transaction_resolver_timeout">
+       <term><varname>foreign_transaction_resolver_timeout</varname>
(<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolver_timeout</varname>
configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Terminate foreign transaction resolver processes that don't
have any foreign
+         transactions to resolve longer than the specified number of
milliseconds.
+         A value of zero disables the timeout mechanism, meaning it
connects to one
+         database until stopping manually.

Can we mention the function name using which one can stop the resolver process?

4.
+   Using the <productname>PostgreSQL</productname>'s atomic commit ensures that
+   all changes on foreign servers end in either commit or rollback using the
+   transaction callback routines

Can we slightly rephase this "Using the PostgreSQL's atomic commit
ensures that all the changes on foreign servers are either committed
or rolled back using the transaction callback routines"?

5.
+       Prepare all transactions on foreign servers.
+       <productname>PostgreSQL</productname> distributed transaction manager
+       prepares all transaction on the foreign servers if two-phase commit is
+       required. Two-phase commit is required when the transaction modifies
+       data on two or more servers including the local server itself and
+       <xref linkend="guc-foreign-twophase-commit"/> is
+       <literal>required</literal>.

/PostgreSQL/PostgreSQL's.

 If all preparations on foreign servers got
+       successful go to the next step.

How about "If the prepare on all foreign servers is successful then go
to the next step"?

 Any failure happens in this step,
+       the server changes to rollback, then rollback all transactions on both
+       local and foreign servers.

Can we rephrase this line to something like: "If there is any failure
in the prepare phase, the server will rollback all the transactions on
both local and foreign servers."?

What if the issued Rollback also failed, say due to network breakdown
between local and one of foreign servers?  Shouldn't such a
transaction be 'in-doubt' state?

6.
+      <para>
+       Commit locally. The server commits transaction locally.  Any
failure happens
+       in this step the server changes to rollback, then rollback all
transactions
+       on both local and foreign servers.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Resolve all prepared transaction on foreign servers. Pprepared
transactions
+       are committed or rolled back according to the result of the
local transaction.
+       This step is normally performed by a foreign transaction
resolver process.
+      </para>

When (in which step) do we commit on foreign servers?  Do Resolver
processes commit on foreign servers, if so, how can we commit locally
without committing on foreign servers, what if the commit on one of
the servers fails? It is not very clear to me from the steps mentioned
here?  Typo, /Pprepared/Prepared

7.
However, foreign transactions
+    become <firstterm>in-doubt</firstterm> in three cases: where the foreign
+    server crashed or lost the connectibility to it during preparing foreign
+    transaction, where the local node crashed during either preparing or
+    resolving foreign transaction and where user canceled the query.

Here the three cases are not very clear.  You might want to use (a)
..., (b) .. ,(c)..  Also, I think the state will be in-doubt even when
we lost connection to server during commit or rollback.

8.
+    One foreign transaction resolver is responsible for transaction resolutions
+    on which one database connecting.

Can we rephrase it to: "One foreign transaction resolver is
responsible for transaction resolutions on the database to which it is
connected."?

9.
+    Note that other <productname>PostgreSQL</productname> feature
such as parallel
+    queries, logical replication, etc., also take worker slots from
+    <varname>max_worker_processes</varname>.

/feature/features

10.
+   <para>
+    Atomic commit requires several configuration options to be set.
+    On the local node, <xref
linkend="guc-max-prepared-foreign-transactions"/> and
+    <xref linkend="guc-max-foreign-transaction-resolvers"/> must be
non-zero value.
+    Additionally the <varname>max_worker_processes</varname> may need
to be adjusted to
+    accommodate for foreign transaction resolver workers, at least
+    (<varname>max_foreign_transaction_resolvers</varname> +
<literal>1</literal>).
+    Note that other <productname>PostgreSQL</productname> feature
such as parallel
+    queries, logical replication, etc., also take worker slots from
+    <varname>max_worker_processes</varname>.
+   </para>

Don't we need to mention foreign_twophase_commit GUC here?

11.
+   <sect2 id="fdw-callbacks-transaction-managements">
+    <title>FDW Routines For Transaction Managements</title>

Managements/Management?

12.
+     Transaction management callbacks are used for doing commit, rollback and
+     prepare the foreign transaction.

Lets write the above sentence as: "Transaction management callbacks
are used to commit, rollback and prepare the foreign transaction."

13.
+    <para>
+     Transaction management callbacks are used for doing commit, rollback and
+     prepare the foreign transaction. If an FDW wishes that its foreign
+     transaction is managed by <productname>PostgreSQL</productname>'s global
+     transaction manager it must provide both
+     <function>CommitForeignTransaction</function> and
+     <function>RollbackForeignTransaction</function>. In addition, if an FDW
+     wishes to support <firstterm>atomic commit</firstterm> (as described in
+     <xref linkend="fdw-transaction-managements"/>), it must provide
+     <function>PrepareForeignTransaction</function> as well and can provide
+     <function>GetPrepareId</function> callback optionally.
+    </para>

What exact functionality a FDW can accomplish if it just supports
CommitForeignTransaction and RollbackForeignTransaction?  It seems it
doesn't care for 2PC, if so, is there any special functionality we can
achieve with this which we can't do without these APIs?

14.
+PrepareForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Prepare the transaction on the foreign server. This function is
called at the
+    pre-commit phase of the local transactions if foreign twophase commit is
+    required. This function is used only for distribute transaction management
+    (see <xref linkend="distributed-transaction"/>).
+    </para>

/distribute/distributed

15.
+   <sect2 id="fdw-transaction-commit-rollback">
+    <title>Commit And Rollback Single Foreign Transaction</title>
+    <para>
+     The FDW callback function <literal>CommitForeignTransaction</literal>
+     and <literal>RollbackForeignTransaction</literal> can be used to commit
+     and rollback the foreign transaction. During transaction commit, the core
+     transaction manager calls
<literal>CommitForeignTransaction</literal> function
+     in the pre-commit phase and calls
+     <literal>RollbackForeignTransaction</literal> function in the
post-rollback
+     phase.
+    </para>

There is no reasoning mentioned as to why CommitForeignTransaction has
to be called in pre-commit phase and RollbackForeignTransaction in
post-rollback phase?  Basically why one in pre phase and other in post
phase?

16.
+       <entry>
+        <literal><function>pg_remove_foreign_xact(<parameter>transaction</parameter>
<type>xid</type>, <parameter>serverid</parameter> <type>oid</type>,
<parameter>userid</parameter> <type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>void</type></entry>
+       <entry>
+        This function works the same as
<function>pg_resolve_foreign_xact</function>
+        except that this removes the foreign transcation entry
without resolution.
+       </entry>

Can we write why and when such a function can be used?  Typo,
/trasnaction/transaction

17.
+     <row>
+      <entry><literal>FdwXactResolutionLock</literal></entry>
+      <entry>Waiting to read or update information of foreign trasnaction
+       resolution.</entry>
+     </row>

/trasnaction/transaction


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 12 Jun 2020 at 19:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 2:10 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Fri, 12 Jun 2020 at 15:37, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > >
> > > > > I think this is a corner case and it is better to simplify the state
> > > > > recording of foreign transactions then to save a CLOG lookup.
> > > > >
> > > >
> > > > The main usage of in-doubt flag is to distinguish between in-doubt
> > > > transactions and other transactions that have their waiter (I call
> > > > on-line transactions).
> > > >
> > >
> > > Which are these other online transactions?  I had assumed that foreign
> > > transaction resolver process is to resolve in-doubt transactions but
> > > it seems it is also used for some other purpose which anyway was the
> > > next question I had while reviewing other sections of docs but let's
> > > clarify as it came up now.
> >
> > When a distributed transaction is committed by COMMIT command, the
> > postgres backend process prepare all foreign transaction and commit
> > the local transaction.
> >

Thank you for your review comments! Let me answer your question first.
I'll see the review comments.

>
> Does this mean that we will mark the xid as committed in CLOG of the
> local server?

Well what I meant is that when the client executes COMMIT command, the
backend executes PREPARE TRANSACTION command on all involved foreign
servers and then marks the xid as committed in clog in the local
server.

>   If so, why is this okay till we commit transactions in
> all the foreign servers, what if we fail to commit on one of the
> servers?

Once the local transaction is committed, all involved foreign
transactions never be rolled back. The backend already prepared all
foreign transaction before local commit, committing prepared foreign
transaction basically doesn't fail. But even if it fails for whatever
reason, we never rollback the all prepared foreign transactions. A
resolver tries to commit foreign transactions at certain intervals.
Does it answer your question?

Regard,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 12, 2020 at 6:24 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 12 Jun 2020 at 19:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > > Which are these other online transactions?  I had assumed that foreign
> > > > transaction resolver process is to resolve in-doubt transactions but
> > > > it seems it is also used for some other purpose which anyway was the
> > > > next question I had while reviewing other sections of docs but let's
> > > > clarify as it came up now.
> > >
> > > When a distributed transaction is committed by COMMIT command, the
> > > postgres backend process prepare all foreign transaction and commit
> > > the local transaction.
> > >
>
> Thank you for your review comments! Let me answer your question first.
> I'll see the review comments.
>
> >
> > Does this mean that we will mark the xid as committed in CLOG of the
> > local server?
>
> Well what I meant is that when the client executes COMMIT command, the
> backend executes PREPARE TRANSACTION command on all involved foreign
> servers and then marks the xid as committed in clog in the local
> server.
>

Won't it create an inconsistency in viewing the data from the
different servers?  Say, such a transaction inserts one row into a
local server and another into the foreign server.  Now, if we follow
the above protocol, the user will be able to see the row from the
local server but not from the foreign server.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, 13 Jun 2020 at 14:02, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 6:24 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Fri, 12 Jun 2020 at 19:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > > Which are these other online transactions?  I had assumed that foreign
> > > > > transaction resolver process is to resolve in-doubt transactions but
> > > > > it seems it is also used for some other purpose which anyway was the
> > > > > next question I had while reviewing other sections of docs but let's
> > > > > clarify as it came up now.
> > > >
> > > > When a distributed transaction is committed by COMMIT command, the
> > > > postgres backend process prepare all foreign transaction and commit
> > > > the local transaction.
> > > >
> >
> > Thank you for your review comments! Let me answer your question first.
> > I'll see the review comments.
> >
> > >
> > > Does this mean that we will mark the xid as committed in CLOG of the
> > > local server?
> >
> > Well what I meant is that when the client executes COMMIT command, the
> > backend executes PREPARE TRANSACTION command on all involved foreign
> > servers and then marks the xid as committed in clog in the local
> > server.
> >
>
> Won't it create an inconsistency in viewing the data from the
> different servers?  Say, such a transaction inserts one row into a
> local server and another into the foreign server.  Now, if we follow
> the above protocol, the user will be able to see the row from the
> local server but not from the foreign server.

Yes, you're right. This atomic commit feature doesn't guarantee such
consistent visibility so-called atomic visibility. Even the local
server is not modified, since a resolver process commits prepared
foreign transactions one by one another user could see an inconsistent
result. Providing globally consistent snapshots to transactions
involving foreign servers is one of the solutions.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take2

From
Tatsuo Ishii
Date:
>> Won't it create an inconsistency in viewing the data from the
>> different servers?  Say, such a transaction inserts one row into a
>> local server and another into the foreign server.  Now, if we follow
>> the above protocol, the user will be able to see the row from the
>> local server but not from the foreign server.
> 
> Yes, you're right. This atomic commit feature doesn't guarantee such
> consistent visibility so-called atomic visibility. Even the local
> server is not modified, since a resolver process commits prepared
> foreign transactions one by one another user could see an inconsistent
> result. Providing globally consistent snapshots to transactions
> involving foreign servers is one of the solutions.

Another approach to the atomic visibility problem is to control
snapshot acquisition timing and commit timing (plus using REPEATABLE
READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
assigns a snapshot at the time when the first command is executed in a
transaction. If we could prevent any commit while any transaction is
acquiring snapshot, and we could prevent any snapshot acquisition while
committing, visibility inconsistency which Amit explained can be
avoided.

This approach was proposed in a academic paper [1].

Good point with the approach is, we don't need to modify PostgreSQL at
all.

Downside of the approach is, we need someone who controls the timings
(in [1], a middleware called "Pangea" was proposed). Also we need to
limit the transaction isolation level to REPEATABLE READ.

[1] http://www.vldb.org/pvldb/vol2/vldb09-694.pdf

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Sun, Jun 14, 2020 at 2:21 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>
> >> Won't it create an inconsistency in viewing the data from the
> >> different servers?  Say, such a transaction inserts one row into a
> >> local server and another into the foreign server.  Now, if we follow
> >> the above protocol, the user will be able to see the row from the
> >> local server but not from the foreign server.
> >
> > Yes, you're right. This atomic commit feature doesn't guarantee such
> > consistent visibility so-called atomic visibility.

Okay, I understand that the purpose of this feature is to provide
atomic commit which means the transaction on all servers involved will
either commit or rollback.  However, I think we should at least see at
a high level how the visibility will work because it might influence
the implementation of this feature.

> > Even the local
> > server is not modified, since a resolver process commits prepared
> > foreign transactions one by one another user could see an inconsistent
> > result. Providing globally consistent snapshots to transactions
> > involving foreign servers is one of the solutions.

How would it be able to do that?  Say, when it decides to take a
snapshot the transaction on the foreign server appears to be committed
but the transaction on the local server won't appear to be committed,
so the consistent data visibility problem as mentioned above could
still arise.

>
> Another approach to the atomic visibility problem is to control
> snapshot acquisition timing and commit timing (plus using REPEATABLE
> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
> assigns a snapshot at the time when the first command is executed in a
> transaction. If we could prevent any commit while any transaction is
> acquiring snapshot, and we could prevent any snapshot acquisition while
> committing, visibility inconsistency which Amit explained can be
> avoided.
>

I think the problem mentioned above can occur with this as well or if
I am missing something then can you explain in further detail how it
won't create problem in the scenario I have used above?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take2

From
Tatsuo Ishii
Date:
>> Another approach to the atomic visibility problem is to control
>> snapshot acquisition timing and commit timing (plus using REPEATABLE
>> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
>> assigns a snapshot at the time when the first command is executed in a
>> transaction. If we could prevent any commit while any transaction is
>> acquiring snapshot, and we could prevent any snapshot acquisition while
>> committing, visibility inconsistency which Amit explained can be
>> avoided.
>>
> 
> I think the problem mentioned above can occur with this as well or if
> I am missing something then can you explain in further detail how it
> won't create problem in the scenario I have used above?

So the problem you mentioned above is like this? (S1/S2 denotes
transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
already committed on N1, S2 sees the row on N1.  However S2 does not
see the row on N2 since S1 has not committed on N2 yet.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S1/N2: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S2/N1: BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN
S1/N1: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N2: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: SELECT * FROM t1; -- see the row
 i 
---
 1
(1 row)

S2/N2: SELECT * FROM t1; -- doesn't see the row
 i 
---
(0 rows)

S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Mon, Jun 15, 2020 at 12:30 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>
> >> Another approach to the atomic visibility problem is to control
> >> snapshot acquisition timing and commit timing (plus using REPEATABLE
> >> READ). In the REPEATABLE READ transaction isolation level, PostgreSQL
> >> assigns a snapshot at the time when the first command is executed in a
> >> transaction. If we could prevent any commit while any transaction is
> >> acquiring snapshot, and we could prevent any snapshot acquisition while
> >> committing, visibility inconsistency which Amit explained can be
> >> avoided.
> >>
> >
> > I think the problem mentioned above can occur with this as well or if
> > I am missing something then can you explain in further detail how it
> > won't create problem in the scenario I have used above?
>
> So the problem you mentioned above is like this? (S1/S2 denotes
> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
> already committed on N1, S2 sees the row on N1.  However S2 does not
> see the row on N2 since S1 has not committed on N2 yet.
>

Yeah, something on these lines but S2 can execute the query on N1
directly which should fetch the data from both N1 and N2.  Even if
there is a solution using REPEATABLE READ isolation level we might not
prefer to use that as the only level for distributed transactions, it
might be too costly but let us first see how does it solve the
problem?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 15 Jun 2020 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 14, 2020 at 2:21 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> >
> > >> Won't it create an inconsistency in viewing the data from the
> > >> different servers?  Say, such a transaction inserts one row into a
> > >> local server and another into the foreign server.  Now, if we follow
> > >> the above protocol, the user will be able to see the row from the
> > >> local server but not from the foreign server.
> > >
> > > Yes, you're right. This atomic commit feature doesn't guarantee such
> > > consistent visibility so-called atomic visibility.
>
> Okay, I understand that the purpose of this feature is to provide
> atomic commit which means the transaction on all servers involved will
> either commit or rollback.  However, I think we should at least see at
> a high level how the visibility will work because it might influence
> the implementation of this feature.
>
> > > Even the local
> > > server is not modified, since a resolver process commits prepared
> > > foreign transactions one by one another user could see an inconsistent
> > > result. Providing globally consistent snapshots to transactions
> > > involving foreign servers is one of the solutions.
>
> How would it be able to do that?  Say, when it decides to take a
> snapshot the transaction on the foreign server appears to be committed
> but the transaction on the local server won't appear to be committed,
> so the consistent data visibility problem as mentioned above could
> still arise.

There are many solutions. For instance, in Postgres-XC/X2 (and maybe
XL), there is a GTM node that is responsible for providing global
transaction IDs (GXID) and globally consistent snapshots. All
transactions need to access GTM when checking the distributed
transaction status as well as starting transactions and ending
transactions. IIUC if a global transaction accesses a tuple whose GXID
is included in its global snapshot it waits for that transaction to be
committed or rolled back.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Mon, Jun 15, 2020 at 7:06 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Mon, 15 Jun 2020 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > > > Even the local
> > > > server is not modified, since a resolver process commits prepared
> > > > foreign transactions one by one another user could see an inconsistent
> > > > result. Providing globally consistent snapshots to transactions
> > > > involving foreign servers is one of the solutions.
> >
> > How would it be able to do that?  Say, when it decides to take a
> > snapshot the transaction on the foreign server appears to be committed
> > but the transaction on the local server won't appear to be committed,
> > so the consistent data visibility problem as mentioned above could
> > still arise.
>
> There are many solutions. For instance, in Postgres-XC/X2 (and maybe
> XL), there is a GTM node that is responsible for providing global
> transaction IDs (GXID) and globally consistent snapshots. All
> transactions need to access GTM when checking the distributed
> transaction status as well as starting transactions and ending
> transactions. IIUC if a global transaction accesses a tuple whose GXID
> is included in its global snapshot it waits for that transaction to be
> committed or rolled back.
>

Is there some mapping between GXID and XIDs allocated for each node or
will each node use the GXID as XID to modify the data?   Are we fine
with parking the work for global snapshots and atomic visibility to a
separate patch and just proceed with the design proposed by this
patch?  I am asking because I thought there might be some impact on
the design of this patch based on what we decide for that work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Tue, Jun 16, 2020 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 7:06 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Mon, 15 Jun 2020 at 15:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > > > Even the local
> > > > > server is not modified, since a resolver process commits prepared
> > > > > foreign transactions one by one another user could see an inconsistent
> > > > > result. Providing globally consistent snapshots to transactions
> > > > > involving foreign servers is one of the solutions.
> > >
> > > How would it be able to do that?  Say, when it decides to take a
> > > snapshot the transaction on the foreign server appears to be committed
> > > but the transaction on the local server won't appear to be committed,
> > > so the consistent data visibility problem as mentioned above could
> > > still arise.
> >
> > There are many solutions. For instance, in Postgres-XC/X2 (and maybe
> > XL), there is a GTM node that is responsible for providing global
> > transaction IDs (GXID) and globally consistent snapshots. All
> > transactions need to access GTM when checking the distributed
> > transaction status as well as starting transactions and ending
> > transactions. IIUC if a global transaction accesses a tuple whose GXID
> > is included in its global snapshot it waits for that transaction to be
> > committed or rolled back.
> >
>
> Is there some mapping between GXID and XIDs allocated for each node or
> will each node use the GXID as XID to modify the data?   Are we fine
> with parking the work for global snapshots and atomic visibility to a
> separate patch and just proceed with the design proposed by this
> patch?

Distributed transaction involves, atomic commit,  atomic visibility
and global consistency. 2PC is the only practical solution for atomic
commit. There are some improvements over 2PC but those are add ons to
the basic 2PC, which is what this patch provides. Atomic visibility
and global consistency however have alternative solutions but all of
those solutions require 2PC to be supported. Each of those are large
pieces of work and trying to get everything in may not work. Once we
have basic 2PC in place, there will be a ground to experiment with
solutions for global consistency and atomic visibility. If we manage
to do it right, we could make it pluggable as well. So, I think we
should concentrate on supporting basic 2PC work now.

> I am asking because I thought there might be some impact on
> the design of this patch based on what we decide for that work.
>

Since 2PC is at the heart of any distributed transaction system, the
impact will be low. Figuring all of that, without having basic 2PC,
will be very hard.

-- 
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 12 Jun 2020 at 19:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Thank you for your reviews on 0003 patch. I've incorporated your
comments. I'll submit the latest version patch later as the design or
scope might change as a result of the discussion.

>
> Few more comments on v22-0003-Documentation-update
> --------------------------------------------------------------------------------------
> 1.
> +          When <literal>disabled</literal> there can be risk of database
> +          consistency among all servers that involved in the distributed
> +          transaction when some foreign server crashes during committing the
> +          distributed transaction.
>
> Will it read better if rephrase above to something like: "When
> <literal>disabled</literal> there can be a risk of database
> consistency if one or more foreign servers crashes while committing
> the distributed transaction."?

Fixed.

>
> 2.
> +      <varlistentry
> id="guc-foreign-transaction-resolution-rety-interval"
> xreflabel="foreign_transaction_resolution_retry_interval">
> +       <term><varname>foreign_transaction_resolution_retry_interval</varname>
> (<type>integer</type>)
> +        <indexterm>
> +         <primary><varname>foreign_transaction_resolution_interval</varname>
> configuration parameter</primary>
> +        </indexterm>
> +       </term>
> +       <listitem>
> +        <para>
> +         Specify how long the foreign transaction resolver should
> wait when the last resolution
> +         fails before retrying to resolve foreign transaction. This
> parameter can only be set in the
> +         <filename>postgresql.conf</filename> file or on the server
> command line.
> +        </para>
> +        <para>
> +         The default value is 10 seconds.
> +        </para>
> +       </listitem>
> +      </varlistentry>
>
> Typo.  <varlistentry
> id="guc-foreign-transaction-resolution-rety-interval", spelling of
> retry is wrong.  Do we really need such a guc parameter?  I think we
> can come up with some simple algorithm to retry after a few seconds
> and then increase that interval of retry if we fail again or something
> like that.  I don't know how users can come up with some non-default
> value for this variable.

For example, in a low-reliable network environment, setting lower
value would help to minimize the backend wait time in case of
connection lost. But I also agree with your point. In terms of
implementation, having backends wait for the fixed time is more simple
but we can do such incremental interval by remembering the retry count
for each foreign transaction.

An open question regarding retrying foreign transaction resolution is
how we process the case where an involved foreign server is down for a
very long. If an online transaction is waiting to be resolved, there
is no way to exit from the wait loop other than either the user sends
a cancel request or the crashed server is restored. But if the foreign
server has to be down for a long time, I think it’s not practical to
send a cancel request because the client would need something like a
timeout mechanism. So I think it might be better to provide a way to
cancel the waiting without the user sending a cancel. For example,
having a timeout or having the limit of the retry count. If an
in-doubt transaction is waiting to be resolved, we keep trying to
resolve the foreign transaction at an interval. But I wonder if the
user might want to disable the automatic in-doubt foreign transaction
in some cases, for example, where the user knows the crashed server
will not be restored for a long time. I’m thinking that we can provide
a way to disable automatic foreign transaction resolution or disable
it for the particular foreign transaction.

>
> 3
> +      <varlistentry id="guc-foreign-transaction-resolver-timeout"
> xreflabel="foreign_transaction_resolver_timeout">
> +       <term><varname>foreign_transaction_resolver_timeout</varname>
> (<type>integer</type>)
> +        <indexterm>
> +         <primary><varname>foreign_transaction_resolver_timeout</varname>
> configuration parameter</primary>
> +        </indexterm>
> +       </term>
> +       <listitem>
> +        <para>
> +         Terminate foreign transaction resolver processes that don't
> have any foreign
> +         transactions to resolve longer than the specified number of
> milliseconds.
> +         A value of zero disables the timeout mechanism, meaning it
> connects to one
> +         database until stopping manually.
>
> Can we mention the function name using which one can stop the resolver process?

Fixed.

>
> 4.
> +   Using the <productname>PostgreSQL</productname>'s atomic commit ensures that
> +   all changes on foreign servers end in either commit or rollback using the
> +   transaction callback routines
>
> Can we slightly rephase this "Using the PostgreSQL's atomic commit
> ensures that all the changes on foreign servers are either committed
> or rolled back using the transaction callback routines"?

Fixed.

>
> 5.
> +       Prepare all transactions on foreign servers.
> +       <productname>PostgreSQL</productname> distributed transaction manager
> +       prepares all transaction on the foreign servers if two-phase commit is
> +       required. Two-phase commit is required when the transaction modifies
> +       data on two or more servers including the local server itself and
> +       <xref linkend="guc-foreign-twophase-commit"/> is
> +       <literal>required</literal>.
>
> /PostgreSQL/PostgreSQL's.

Fixed.

>
>  If all preparations on foreign servers got
> +       successful go to the next step.
>
> How about "If the prepare on all foreign servers is successful then go
> to the next step"?

Fixed.

>
>  Any failure happens in this step,
> +       the server changes to rollback, then rollback all transactions on both
> +       local and foreign servers.
>
> Can we rephrase this line to something like: "If there is any failure
> in the prepare phase, the server will rollback all the transactions on
> both local and foreign servers."?

Fixed.

>
> What if the issued Rollback also failed, say due to network breakdown
> between local and one of foreign servers?  Shouldn't such a
> transaction be 'in-doubt' state?

Rollback API to rollback transaction in one-phase can be called
recursively. So FDWs have to tolerate recursive calling.

In the current patch, all transaction operations are performed
synchronously. That is, foreign transaction never becomes in-doubt
state without explicit cancel by the user or the local node crash.
That way, subsequent transactions can assume that precedent
distributed transactions are already resolved unless the user
canceled.

Let me explain the details:

If the transaction turns rollback due to failure before the local
commit, we attempt to do both ROLLBACK and ROLLBACK PREPARED against
foreign transactions whose status is PREPARING. That is, we end the
foreign transactions by doing ROLLBACK. And since we're not sure
preparation has been completed on the foreign server the backend asks
the resolver process for doing ROLLBACK PREPARED on the foreign
servers. Therefore FDWs have to tolerate OBJECT_NOT_FOUND error in
abort case. Since the backend process returns an acknowledgment to the
client only after rolling back all foreign transactions, these foreign
transactional don't remain as in-doubt state.

If rolling back failed after the local commit (i.g., the client does
ROLLBACK and the resolver failed to do ROLLBACK PREPARED), a resolver
process will relaunch and retry to do ROLLBACK PREPARED. The backend
process waits until ROLLBACK PREPARED is successfully done or the user
cancels. So the foreign transactions don't become in-doubt
transactions.

Synchronousness is also an open question. If we want to support atomic
commit in an asynchronous manner it might be better to implement it
first in terms of complexity. The backend returns an acknowledgment to
the client immediately after asking the resolver process. It’s known
as the early acknowledgment technique. The downside is that the user
who wants to see the result of precedent transaction needs to make
sure the precedent transaction is committed on all foreign servers. We
will also need to think about how to control it by GUC parameter when
we have synchronous distributed transaction commit. Perhaps it’s
better to control it independent of synchronous replication.

>
> 6.
> +      <para>
> +       Commit locally. The server commits transaction locally.  Any
> failure happens
> +       in this step the server changes to rollback, then rollback all
> transactions
> +       on both local and foreign servers.
> +      </para>
> +     </listitem>
> +     <listitem>
> +      <para>
> +       Resolve all prepared transaction on foreign servers. Pprepared
> transactions
> +       are committed or rolled back according to the result of the
> local transaction.
> +       This step is normally performed by a foreign transaction
> resolver process.
> +      </para>
>
> When (in which step) do we commit on foreign servers?  Do Resolver
> processes commit on foreign servers, if so, how can we commit locally
> without committing on foreign servers, what if the commit on one of
> the servers fails? It is not very clear to me from the steps mentioned
> here?

In case 2pc is required, we commit transactions on foreign servers at
the final step by the resolver process. If the committing a prepared
transaction on one of the servers fails, a resolver process relaunches
after an interval and retry to commit.

In case 2pc is not required, we commit transactions on foreign servers
at pre-commit phase by the backend.

> Typo, /Pprepared/Prepared

Fixed.

>
> 7.
> However, foreign transactions
> +    become <firstterm>in-doubt</firstterm> in three cases: where the foreign
> +    server crashed or lost the connectibility to it during preparing foreign
> +    transaction, where the local node crashed during either preparing or
> +    resolving foreign transaction and where user canceled the query.
>
> Here the three cases are not very clear.  You might want to use (a)
> ..., (b) .. ,(c)..

Fixed. I change it to itemizedlist.

> Also, I think the state will be in-doubt even when
> we lost connection to server during commit or rollback.

Let me correct the cases of the foreign transactions remain as
in-doubt state. There are two cases:

* The local node crashed
* The user canceled the transaction commit or rollback.

Even when we lost connection to the server during commit or rollback
prepared transaction, a backend doesn’t return an acknowledgment to
the client until either transaction is successfully resolved, the user
cancels the transaction, or the local node crashes.

>
> 8.
> +    One foreign transaction resolver is responsible for transaction resolutions
> +    on which one database connecting.
>
> Can we rephrase it to: "One foreign transaction resolver is
> responsible for transaction resolutions on the database to which it is
> connected."?

Fixed.

>
> 9.
> +    Note that other <productname>PostgreSQL</productname> feature
> such as parallel
> +    queries, logical replication, etc., also take worker slots from
> +    <varname>max_worker_processes</varname>.
>
> /feature/features

Fixed.

>
> 10.
> +   <para>
> +    Atomic commit requires several configuration options to be set.
> +    On the local node, <xref
> linkend="guc-max-prepared-foreign-transactions"/> and
> +    <xref linkend="guc-max-foreign-transaction-resolvers"/> must be
> non-zero value.
> +    Additionally the <varname>max_worker_processes</varname> may need
> to be adjusted to
> +    accommodate for foreign transaction resolver workers, at least
> +    (<varname>max_foreign_transaction_resolvers</varname> +
> <literal>1</literal>).
> +    Note that other <productname>PostgreSQL</productname> feature
> such as parallel
> +    queries, logical replication, etc., also take worker slots from
> +    <varname>max_worker_processes</varname>.
> +   </para>
>
> Don't we need to mention foreign_twophase_commit GUC here?

Fixed.

>
> 11.
> +   <sect2 id="fdw-callbacks-transaction-managements">
> +    <title>FDW Routines For Transaction Managements</title>
>
> Managements/Management?

Fixed.

>
> 12.
> +     Transaction management callbacks are used for doing commit, rollback and
> +     prepare the foreign transaction.
>
> Lets write the above sentence as: "Transaction management callbacks
> are used to commit, rollback and prepare the foreign transaction."

Fixed.

>
> 13.
> +    <para>
> +     Transaction management callbacks are used for doing commit, rollback and
> +     prepare the foreign transaction. If an FDW wishes that its foreign
> +     transaction is managed by <productname>PostgreSQL</productname>'s global
> +     transaction manager it must provide both
> +     <function>CommitForeignTransaction</function> and
> +     <function>RollbackForeignTransaction</function>. In addition, if an FDW
> +     wishes to support <firstterm>atomic commit</firstterm> (as described in
> +     <xref linkend="fdw-transaction-managements"/>), it must provide
> +     <function>PrepareForeignTransaction</function> as well and can provide
> +     <function>GetPrepareId</function> callback optionally.
> +    </para>
>
> What exact functionality a FDW can accomplish if it just supports
> CommitForeignTransaction and RollbackForeignTransaction?  It seems it
> doesn't care for 2PC, if so, is there any special functionality we can
> achieve with this which we can't do without these APIs?

There is no special functionality even if an FDW implements
CommitForeignTrasnaction and RollbackForeignTransaction. Currently,
since there is no transaction API in FDW APIs, FDW developer has to
use XactCallback to control transactions but there is no
documentation. The idea of allowing an FDW to support only
CommitForeignTrasnaction and RollbackForeignTransaction is that FDW
developers can implement transaction management easily. But in the
first patch, we also can disallow it to make the implementation
simple.

>
> 14.
> +PrepareForeignTransaction(FdwXactRslvState *frstate);
> +</programlisting>
> +    Prepare the transaction on the foreign server. This function is
> called at the
> +    pre-commit phase of the local transactions if foreign twophase commit is
> +    required. This function is used only for distribute transaction management
> +    (see <xref linkend="distributed-transaction"/>).
> +    </para>
>
> /distribute/distributed

Fixed.

>
> 15.
> +   <sect2 id="fdw-transaction-commit-rollback">
> +    <title>Commit And Rollback Single Foreign Transaction</title>
> +    <para>
> +     The FDW callback function <literal>CommitForeignTransaction</literal>
> +     and <literal>RollbackForeignTransaction</literal> can be used to commit
> +     and rollback the foreign transaction. During transaction commit, the core
> +     transaction manager calls
> <literal>CommitForeignTransaction</literal> function
> +     in the pre-commit phase and calls
> +     <literal>RollbackForeignTransaction</literal> function in the
> post-rollback
> +     phase.
> +    </para>
>
> There is no reasoning mentioned as to why CommitForeignTransaction has
> to be called in pre-commit phase and RollbackForeignTransaction in
> post-rollback phase?  Basically why one in pre phase and other in post
> phase?

Good point. This behavior just follows what postgres_fdw does. I'm not
sure the exact reason why postgres_fdw commit the transaction in
pre-commit phase but I guess the committing a foreign transaction is
likely to abort comparing to the local commit, it might be better to
do first.

>
> 16.
> +       <entry>
> +        <literal><function>pg_remove_foreign_xact(<parameter>transaction</parameter>
> <type>xid</type>, <parameter>serverid</parameter> <type>oid</type>,
> <parameter>userid</parameter> <type>oid</type>)</function></literal>
> +       </entry>
> +       <entry><type>void</type></entry>
> +       <entry>
> +        This function works the same as
> <function>pg_resolve_foreign_xact</function>
> +        except that this removes the foreign transcation entry
> without resolution.
> +       </entry>
>
> Can we write why and when such a function can be used?  Typo,
> /trasnaction/transaction

Fixed.

>
> 17.
> +     <row>
> +      <entry><literal>FdwXactResolutionLock</literal></entry>
> +      <entry>Waiting to read or update information of foreign trasnaction
> +       resolution.</entry>
> +     </row>
>
> /trasnaction/transaction

Fixed.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take2

From
Tatsuo Ishii
Date:
>> > I think the problem mentioned above can occur with this as well or if
>> > I am missing something then can you explain in further detail how it
>> > won't create problem in the scenario I have used above?
>>
>> So the problem you mentioned above is like this? (S1/S2 denotes
>> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
>> already committed on N1, S2 sees the row on N1.  However S2 does not
>> see the row on N2 since S1 has not committed on N2 yet.
>>
> 
> Yeah, something on these lines but S2 can execute the query on N1
> directly which should fetch the data from both N1 and N2.

The algorythm assumes that any client should access database through a
middle ware. Such direct access is prohibited.

> Even if
> there is a solution using REPEATABLE READ isolation level we might not
> prefer to use that as the only level for distributed transactions, it
> might be too costly but let us first see how does it solve the
> problem?

The paper extends Snapshot Isolation (SI, which is same as our
REPEATABLE READ isolation level) to "Global Snapshot Isolation", GSI).
I think GSI will solve the problem (atomic visibility) we are
discussing.

Unlike READ COMMITTED, REPEATABLE READ acquires snapshot at the time
when the first command is executed in a transaction (READ COMMITTED
acquires a snapshot at each command in a transaction). Pangea controls
the timing of the snapshot acquisition on pair of transactions
(S1/N1,N2 or S2/N1,N2) so that each pair acquires the same
snapshot. To achieve this, while some transactions are trying to
acquire snapshot, any commit operation should be postponed. Likewise
any snapshot acquisition should wait until any in progress commit
operations are finished (see Algorithm I to III in the paper for more
details). With this rule, the previous example now looks like this:
you can see SELECT on S2/N1 and S2/N2 give the same result.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S2/N1: BEGIN;
BEGIN
S1/N1: SET transaction_isolation TO 'repeatable read';
SET
S1/N2: SET transaction_isolation TO 'repeatable read';
SET
S2/N1: SET transaction_isolation TO 'repeatable read';
SET
S1/N1: INSERT INTO t1 VALUES (1);
INSERT 0 1
S1/N2: INSERT INTO t1 VALUES (1);
INSERT 0 1
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
(0 rows)

S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Bruce Momjian
Date:
On Tue, Jun 16, 2020 at 06:42:52PM +0530, Ashutosh Bapat wrote:
> > Is there some mapping between GXID and XIDs allocated for each node or
> > will each node use the GXID as XID to modify the data?   Are we fine
> > with parking the work for global snapshots and atomic visibility to a
> > separate patch and just proceed with the design proposed by this
> > patch?
> 
> Distributed transaction involves, atomic commit,  atomic visibility
> and global consistency. 2PC is the only practical solution for atomic
> commit. There are some improvements over 2PC but those are add ons to
> the basic 2PC, which is what this patch provides. Atomic visibility
> and global consistency however have alternative solutions but all of
> those solutions require 2PC to be supported. Each of those are large
> pieces of work and trying to get everything in may not work. Once we
> have basic 2PC in place, there will be a ground to experiment with
> solutions for global consistency and atomic visibility. If we manage
> to do it right, we could make it pluggable as well. So, I think we
> should concentrate on supporting basic 2PC work now.

Very good summary, thank you.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
> I've attached the new version patch set. 0006 is a separate patch
> which introduces 'prefer' mode to foreign_twophase_commit.

I hope we can use this feature. Thank you for making patches and 
discussions.
I'm currently understanding the logic and found some minor points to be 
fixed.

I'm sorry if my understanding is wrong.

* The v22 patches need rebase as they can't apply to the current master.

* FdwXactAtomicCommitParticipants said in 
src/backend/access/fdwxact/README
   is not implemented. Is FdwXactParticipants right?

* A following comment says that this code is for "One-phase",
   but second argument of FdwXactParticipantEndTransaction() describes
   this code is not "onephase".

AtEOXact_FdwXact() in fdwxact.c
    /* One-phase rollback foreign transaction */
    FdwXactParticipantEndTransaction(fdw_part, false, false);

static void
FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool 
onephase,
    bool for_commit)

* "two_phase_commit" option is mentioned in postgres-fdw.sgml,
    but I can't find related code.

* resolver.c comments have the sentence
   containing two blanks.(Emergency  Termination)

* There are some inconsistency with PostgreSQL wiki.
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

   I understand it's difficult to keep consistency, I think it's ok to 
fix later
   when these patches almost be able to be committed.

   - I can't find "two_phase_commit" option in the source code.
     But 2PC is work if the remote server's "max_prepared_transactions" 
is set
     to non zero value. It is correct work, isn't it?

   - some parameters are renamed or added in latest patches.
     max_prepared_foreign_transaction, max_prepared_transactions and so 
on.

   - typo: froeign_transaction_resolver_timeout

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Tue, Jun 16, 2020 at 8:06 PM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>
> >> > I think the problem mentioned above can occur with this as well or if
> >> > I am missing something then can you explain in further detail how it
> >> > won't create problem in the scenario I have used above?
> >>
> >> So the problem you mentioned above is like this? (S1/S2 denotes
> >> transactions (sessions), N1/N2 is the postgreSQL servers).  Since S1
> >> already committed on N1, S2 sees the row on N1.  However S2 does not
> >> see the row on N2 since S1 has not committed on N2 yet.
> >>
> >
> > Yeah, something on these lines but S2 can execute the query on N1
> > directly which should fetch the data from both N1 and N2.
>
> The algorythm assumes that any client should access database through a
> middle ware. Such direct access is prohibited.
>

okay, so it seems we need few things which middleware (Pangea) expects
if we have to follow the design of paper.

> > Even if
> > there is a solution using REPEATABLE READ isolation level we might not
> > prefer to use that as the only level for distributed transactions, it
> > might be too costly but let us first see how does it solve the
> > problem?
>
> The paper extends Snapshot Isolation (SI, which is same as our
> REPEATABLE READ isolation level) to "Global Snapshot Isolation", GSI).
> I think GSI will solve the problem (atomic visibility) we are
> discussing.
>
> Unlike READ COMMITTED, REPEATABLE READ acquires snapshot at the time
> when the first command is executed in a transaction (READ COMMITTED
> acquires a snapshot at each command in a transaction). Pangea controls
> the timing of the snapshot acquisition on pair of transactions
> (S1/N1,N2 or S2/N1,N2) so that each pair acquires the same
> snapshot. To achieve this, while some transactions are trying to
> acquire snapshot, any commit operation should be postponed. Likewise
> any snapshot acquisition should wait until any in progress commit
> operations are finished (see Algorithm I to III in the paper for more
> details).
>

I haven't read the paper completely but it sounds quite restrictive
(like both commits and snapshots need to wait).  Another point is that
do we want some middleware involved in the solution?   The main thing
I was looking into at this stage is do we think that the current
implementation proposed by the patch for 2PC is generic enough that we
would be later able to integrate the solution for atomic visibility?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 17 Jun 2020 at 09:01, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> > I've attached the new version patch set. 0006 is a separate patch
> > which introduces 'prefer' mode to foreign_twophase_commit.
>
> I hope we can use this feature. Thank you for making patches and
> discussions.
> I'm currently understanding the logic and found some minor points to be
> fixed.
>
> I'm sorry if my understanding is wrong.
>
> * The v22 patches need rebase as they can't apply to the current master.
>
> * FdwXactAtomicCommitParticipants said in
> src/backend/access/fdwxact/README
>    is not implemented. Is FdwXactParticipants right?

Right.

>
> * A following comment says that this code is for "One-phase",
>    but second argument of FdwXactParticipantEndTransaction() describes
>    this code is not "onephase".
>
> AtEOXact_FdwXact() in fdwxact.c
>         /* One-phase rollback foreign transaction */
>         FdwXactParticipantEndTransaction(fdw_part, false, false);
>
> static void
> FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool
> onephase,
>         bool for_commit)
>
> * "two_phase_commit" option is mentioned in postgres-fdw.sgml,
>     but I can't find related code.
>
> * resolver.c comments have the sentence
>    containing two blanks.(Emergency  Termination)
>
> * There are some inconsistency with PostgreSQL wiki.
> https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
>
>    I understand it's difficult to keep consistency, I think it's ok to
> fix later
>    when these patches almost be able to be committed.
>
>    - I can't find "two_phase_commit" option in the source code.
>      But 2PC is work if the remote server's "max_prepared_transactions"
> is set
>      to non zero value. It is correct work, isn't it?

Yes. I had removed two_phase_commit option from postgres_fdw.
Currently, postgres_fdw uses 2pc when 2pc is required. Therefore,
max_prepared_transactions needs to be set to more than one, as you
mentioned.

>
>    - some parameters are renamed or added in latest patches.
>      max_prepared_foreign_transaction, max_prepared_transactions and so
> on.
>
>    - typo: froeign_transaction_resolver_timeout
>

Thank you for your review! I've incorporated your comments on the
local branch. I'll share the latest version patch.

Also, I've updated the wiki page. I'll try to keep the wiki page up-to-date.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Tue, Jun 16, 2020 at 6:43 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jun 16, 2020 at 3:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Is there some mapping between GXID and XIDs allocated for each node or
> > will each node use the GXID as XID to modify the data?   Are we fine
> > with parking the work for global snapshots and atomic visibility to a
> > separate patch and just proceed with the design proposed by this
> > patch?
>
> Distributed transaction involves, atomic commit,  atomic visibility
> and global consistency. 2PC is the only practical solution for atomic
> commit. There are some improvements over 2PC but those are add ons to
> the basic 2PC, which is what this patch provides. Atomic visibility
> and global consistency however have alternative solutions but all of
> those solutions require 2PC to be supported. Each of those are large
> pieces of work and trying to get everything in may not work. Once we
> have basic 2PC in place, there will be a ground to experiment with
> solutions for global consistency and atomic visibility. If we manage
> to do it right, we could make it pluggable as well.
>

I think it is easier said than done. If you want to make it pluggable
or want alternative solutions to adapt the 2PC support provided by us
we should have some idea how those alternative solutions look like.  I
am not telling we have to figure out each and every detail of those
solutions but without paying any attention to the high-level picture
we might end up doing something for 2PC here which either needs a lot
of modifications or might need a design change which would be bad.
Basically, if we later decide to use something like Global Xid to
achieve other features then what we are doing here might not work.

I think it is a good idea to complete the work in pieces where each
piece is useful on its own but without having clarity on the overall
solution that could be a recipe for disaster.  It is possible that you
have some idea in your mind where you can see clearly how this piece
of work can fit in the bigger picture but it is not very apparent to
others or doesn't seem to be documented anywhere.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take2

From
Tatsuo Ishii
Date:
> okay, so it seems we need few things which middleware (Pangea) expects
> if we have to follow the design of paper.

Yes.

> I haven't read the paper completely but it sounds quite restrictive
> (like both commits and snapshots need to wait).

Maybe. There is a performance evaluation in the paper. You might want
to take a look at it.

> Another point is that
> do we want some middleware involved in the solution?   The main thing
> I was looking into at this stage is do we think that the current
> implementation proposed by the patch for 2PC is generic enough that we
> would be later able to integrate the solution for atomic visibility?

My concern is, FDW+2PC without atomic visibility could lead to data
inconsistency among servers in some cases. If my understanding is
correct, FDW+2PC (without atomic visibility) cannot prevent data
inconsistency in the case below. Initially table t1 has only one row
with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
now has different value of i, 0 and 1.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S1/N1: UPDATE t1 SET i = i + 1;    -- i = 1
UPDATE 1
S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: BEGIN;
BEGIN
S2/N2: BEGIN;
BEGIN
S2/N2: DELETE FROM t1 WHERE i = 1;
DELETE 0
S2/N1: DELETE FROM t1 WHERE i = 1;
DELETE 1
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S2/N2: PREPARE TRANSACTION 's2n2';
PREPARE TRANSACTION
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED
S2/N2: COMMIT PREPARED 's2n2';
COMMIT PREPARED
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
 1
(1 row)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 18 Jun 2020 at 08:31, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>
> > okay, so it seems we need few things which middleware (Pangea) expects
> > if we have to follow the design of paper.
>
> Yes.
>
> > I haven't read the paper completely but it sounds quite restrictive
> > (like both commits and snapshots need to wait).
>
> Maybe. There is a performance evaluation in the paper. You might want
> to take a look at it.
>
> > Another point is that
> > do we want some middleware involved in the solution?   The main thing
> > I was looking into at this stage is do we think that the current
> > implementation proposed by the patch for 2PC is generic enough that we
> > would be later able to integrate the solution for atomic visibility?
>
> My concern is, FDW+2PC without atomic visibility could lead to data
> inconsistency among servers in some cases. If my understanding is
> correct, FDW+2PC (without atomic visibility) cannot prevent data
> inconsistency in the case below. Initially table t1 has only one row
> with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
> now has different value of i, 0 and 1.

IIUC the following sequence won't happen because COMMIT PREPARED
's1n1' cannot be executed before PREPARE TRANSACTION 's1n2'. But as
you mentioned, we cannot prevent data inconsistency even with FDW+2PC
e.g., when S2 starts a transaction between COMMIT PREPARED on N1 and
COMMIT PREPARED on N2 by S1. The point is this data inconsistency is
lead by an inconsistent read but not by an inconsistent commit
results. I think there are kinds of possibilities causing data
inconsistency but atomic commit and atomic visibility eliminate
different possibilities. We can eliminate all possibilities of data
inconsistency only after we support 2PC and globally MVCC.

>
> S1/N1: DROP TABLE t1;
> DROP TABLE
> S1/N1: CREATE TABLE t1(i int);
> CREATE TABLE
> S1/N1: INSERT INTO t1 VALUES(0);
> INSERT 0 1
> S1/N2: DROP TABLE t1;
> DROP TABLE
> S1/N2: CREATE TABLE t1(i int);
> CREATE TABLE
> S1/N2: INSERT INTO t1 VALUES(0);
> INSERT 0 1
> S1/N1: BEGIN;
> BEGIN
> S1/N2: BEGIN;
> BEGIN
> S1/N1: UPDATE t1 SET i = i + 1; -- i = 1
> UPDATE 1
> S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
> UPDATE 1
> S1/N1: PREPARE TRANSACTION 's1n1';
> PREPARE TRANSACTION
> S1/N1: COMMIT PREPARED 's1n1';
> COMMIT PREPARED
> S2/N1: BEGIN;
> BEGIN
> S2/N2: BEGIN;
> BEGIN
> S2/N2: DELETE FROM t1 WHERE i = 1;
> DELETE 0
> S2/N1: DELETE FROM t1 WHERE i = 1;
> DELETE 1
> S1/N2: PREPARE TRANSACTION 's1n2';
> PREPARE TRANSACTION
> S2/N1: PREPARE TRANSACTION 's2n1';
> PREPARE TRANSACTION
> S2/N2: PREPARE TRANSACTION 's2n2';
> PREPARE TRANSACTION
> S1/N2: COMMIT PREPARED 's1n2';
> COMMIT PREPARED
> S2/N1: COMMIT PREPARED 's2n1';
> COMMIT PREPARED
> S2/N2: COMMIT PREPARED 's2n2';
> COMMIT PREPARED
> S2/N1: SELECT * FROM t1;
>  i
> ---
> (0 rows)
>
> S2/N2: SELECT * FROM t1;
>  i
> ---
>  1
> (1 row)
>

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take2

From
Tatsuo Ishii
Date:
>> My concern is, FDW+2PC without atomic visibility could lead to data
>> inconsistency among servers in some cases. If my understanding is
>> correct, FDW+2PC (without atomic visibility) cannot prevent data
>> inconsistency in the case below. Initially table t1 has only one row
>> with i = 0 on both N1 and N2. By executing S1 and S2 concurrently, t1
>> now has different value of i, 0 and 1.
> 
> IIUC the following sequence won't happen because COMMIT PREPARED
> 's1n1' cannot be executed before PREPARE TRANSACTION 's1n2'.

You are right.

> But as
> you mentioned, we cannot prevent data inconsistency even with FDW+2PC
> e.g., when S2 starts a transaction between COMMIT PREPARED on N1 and
> COMMIT PREPARED on N2 by S1.

Ok, example updated.

S1/N1: DROP TABLE t1;
DROP TABLE
S1/N1: CREATE TABLE t1(i int);
CREATE TABLE
S1/N1: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N2: DROP TABLE t1;
DROP TABLE
S1/N2: CREATE TABLE t1(i int);
CREATE TABLE
S1/N2: INSERT INTO t1 VALUES(0);
INSERT 0 1
S1/N1: BEGIN;
BEGIN
S1/N2: BEGIN;
BEGIN
S1/N1: UPDATE t1 SET i = i + 1;    -- i = 1
UPDATE 1
S1/N2: UPDATE t1 SET i = i + 1; -- i = 1
UPDATE 1
S2/N1: BEGIN;
BEGIN
S2/N2: BEGIN;
BEGIN
S1/N1: PREPARE TRANSACTION 's1n1';
PREPARE TRANSACTION
S1/N2: PREPARE TRANSACTION 's1n2';
PREPARE TRANSACTION
S2/N1: PREPARE TRANSACTION 's2n1';
PREPARE TRANSACTION
S2/N2: PREPARE TRANSACTION 's2n2';
PREPARE TRANSACTION
S1/N1: COMMIT PREPARED 's1n1';
COMMIT PREPARED
S2/N1: DELETE FROM t1 WHERE i = 1;
DELETE 1
S2/N2: DELETE FROM t1 WHERE i = 1;
DELETE 0
S1/N2: COMMIT PREPARED 's1n2';
COMMIT PREPARED
S2/N1: COMMIT PREPARED 's2n1';
COMMIT PREPARED
S2/N2: COMMIT PREPARED 's2n2';
COMMIT PREPARED
S2/N1: SELECT * FROM t1;
 i 
---
(0 rows)

S2/N2: SELECT * FROM t1;
 i 
---
 1
(1 row)

> The point is this data inconsistency is
> lead by an inconsistent read but not by an inconsistent commit
> results. I think there are kinds of possibilities causing data
> inconsistency but atomic commit and atomic visibility eliminate
> different possibilities. We can eliminate all possibilities of data
> inconsistency only after we support 2PC and globally MVCC.

IMO any permanent data inconsistency is a serious problem for users no
matter what the technical reasons are.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Thu, Jun 18, 2020 at 5:01 AM Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>
> > Another point is that
> > do we want some middleware involved in the solution?   The main thing
> > I was looking into at this stage is do we think that the current
> > implementation proposed by the patch for 2PC is generic enough that we
> > would be later able to integrate the solution for atomic visibility?
>
> My concern is, FDW+2PC without atomic visibility could lead to data
> inconsistency among servers in some cases. If my understanding is
> correct, FDW+2PC (without atomic visibility) cannot prevent data
> inconsistency in the case below.
>

You are right and we are not going to claim that after this feature is
committed.  This feature has independent use cases like it can allow
parallel copy when foreign tables are involved once we have parallel
copy and surely there will be more.  I think it is clear that we need
atomic visibility (some way to ensure global consistency) to avoid the
data inconsistency problems you and I are worried about and we can do
that as a separate patch but at this stage, it would be good if we can
have some high-level design of that as well so that if we need some
adjustments in the design/implementation of this patch then we can do
it now.  I think there is some discussion on the other threads (like
[1]) about the kind of stuff we are worried about which I need to
follow up on to study the impact.

Having said that, I don't think that is a reason to stop reviewing or
working on this patch.

[1] - https://www.postgresql.org/message-id/flat/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Bruce Momjian
Date:
On Thu, Jun 18, 2020 at 04:09:56PM +0530, Amit Kapila wrote:
> You are right and we are not going to claim that after this feature is
> committed.  This feature has independent use cases like it can allow
> parallel copy when foreign tables are involved once we have parallel
> copy and surely there will be more.  I think it is clear that we need
> atomic visibility (some way to ensure global consistency) to avoid the
> data inconsistency problems you and I are worried about and we can do
> that as a separate patch but at this stage, it would be good if we can
> have some high-level design of that as well so that if we need some
> adjustments in the design/implementation of this patch then we can do
> it now.  I think there is some discussion on the other threads (like
> [1]) about the kind of stuff we are worried about which I need to
> follow up on to study the impact.
> 
> Having said that, I don't think that is a reason to stop reviewing or
> working on this patch.

I think our first step is to allow sharding to work on read-only
databases, e.g. data warehousing.  Read/write will require global
snapshots.  It is true that 2PC is limited usefulness without global
snapshots, because, by definition, systems using 2PC are read-write
systems.   However, I can see cases where you are loading data into a
data warehouse but want 2PC so the systems remain consistent even if
there is a crash during loading.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Thu, Jun 18, 2020 at 6:49 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Jun 18, 2020 at 04:09:56PM +0530, Amit Kapila wrote:
> > You are right and we are not going to claim that after this feature is
> > committed.  This feature has independent use cases like it can allow
> > parallel copy when foreign tables are involved once we have parallel
> > copy and surely there will be more.  I think it is clear that we need
> > atomic visibility (some way to ensure global consistency) to avoid the
> > data inconsistency problems you and I are worried about and we can do
> > that as a separate patch but at this stage, it would be good if we can
> > have some high-level design of that as well so that if we need some
> > adjustments in the design/implementation of this patch then we can do
> > it now.  I think there is some discussion on the other threads (like
> > [1]) about the kind of stuff we are worried about which I need to
> > follow up on to study the impact.
> >
> > Having said that, I don't think that is a reason to stop reviewing or
> > working on this patch.
>
> I think our first step is to allow sharding to work on read-only
> databases, e.g. data warehousing.  Read/write will require global
> snapshots.  It is true that 2PC is limited usefulness without global
> snapshots, because, by definition, systems using 2PC are read-write
> systems.   However, I can see cases where you are loading data into a
> data warehouse but want 2PC so the systems remain consistent even if
> there is a crash during loading.
>

For sharding, just implementing 2PC without global consistency
provides limited functionality. But for general purpose federated
databases 2PC serves an important functionality - atomic visibility.
When PostgreSQL is used as one of the coordinators in a heterogeneous
federated database system, it's not expected to have global
consistency or even atomic visibility. But it needs a guarantee that
once a transaction commit, all its legs are committed. 2PC provides
that guarantee as long as the other databases keep their promise that
prepared transactions will always get committed when requested so.
Subtle to this is HA requirement from these databases as well. So the
functionality provided by this patch is important outside the sharding
case as well.

As you said, even for a data warehousing application, there is some
write in the form of loading/merging data. If that write happens
across multiple servers, we need  atomic commit to be guaranteed. Some
of these applications can work even if global consistency and atomic
visibility is guaranteed eventually.

-- 
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 17 Jun 2020 at 14:07, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 17 Jun 2020 at 09:01, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >
> > > I've attached the new version patch set. 0006 is a separate patch
> > > which introduces 'prefer' mode to foreign_twophase_commit.
> >
> > I hope we can use this feature. Thank you for making patches and
> > discussions.
> > I'm currently understanding the logic and found some minor points to be
> > fixed.
> >
> > I'm sorry if my understanding is wrong.
> >
> > * The v22 patches need rebase as they can't apply to the current master.
> >
> > * FdwXactAtomicCommitParticipants said in
> > src/backend/access/fdwxact/README
> >    is not implemented. Is FdwXactParticipants right?
>
> Right.
>
> >
> > * A following comment says that this code is for "One-phase",
> >    but second argument of FdwXactParticipantEndTransaction() describes
> >    this code is not "onephase".
> >
> > AtEOXact_FdwXact() in fdwxact.c
> >         /* One-phase rollback foreign transaction */
> >         FdwXactParticipantEndTransaction(fdw_part, false, false);
> >
> > static void
> > FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool
> > onephase,
> >         bool for_commit)
> >
> > * "two_phase_commit" option is mentioned in postgres-fdw.sgml,
> >     but I can't find related code.
> >
> > * resolver.c comments have the sentence
> >    containing two blanks.(Emergency  Termination)
> >
> > * There are some inconsistency with PostgreSQL wiki.
> > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> >
> >    I understand it's difficult to keep consistency, I think it's ok to
> > fix later
> >    when these patches almost be able to be committed.
> >
> >    - I can't find "two_phase_commit" option in the source code.
> >      But 2PC is work if the remote server's "max_prepared_transactions"
> > is set
> >      to non zero value. It is correct work, isn't it?
>
> Yes. I had removed two_phase_commit option from postgres_fdw.
> Currently, postgres_fdw uses 2pc when 2pc is required. Therefore,
> max_prepared_transactions needs to be set to more than one, as you
> mentioned.
>
> >
> >    - some parameters are renamed or added in latest patches.
> >      max_prepared_foreign_transaction, max_prepared_transactions and so
> > on.
> >
> >    - typo: froeign_transaction_resolver_timeout
> >
>
> Thank you for your review! I've incorporated your comments on the
> local branch. I'll share the latest version patch.
>
> Also, I've updated the wiki page. I'll try to keep the wiki page up-to-date.
>

I've attached the latest version patches. I've incorporated the review
comments I got so far and improved locking strategy.

Please review it.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Tue, Jun 23, 2020 at 9:03 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
>
> I've attached the latest version patches. I've incorporated the review
> comments I got so far and improved locking strategy.
>

Thanks for updating the patch.

> Please review it.
>

I think at this stage it is important that we do some study of various
approaches to achieve this work and come up with a comparison of the
pros and cons of each approach (a) what this patch provides, (b) what
is implemented in Global Snapshots patch [1], (c) if possible, what is
implemented in Postgres-XL.  I fear that if go too far in spending
effort on this and later discovered that it can be better done via
some other available patch/work (maybe due to a reasons like that
approach can easily extended to provide atomic visibility or the
design is more robust, etc.) then it can lead to a lot of rework.

[1] - https://www.postgresql.org/message-id/20200622150636.GB28999%40momjian.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 23 Jun 2020 at 13:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 9:03 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> >
> > I've attached the latest version patches. I've incorporated the review
> > comments I got so far and improved locking strategy.
> >
>
> Thanks for updating the patch.
>
> > Please review it.
> >
>
> I think at this stage it is important that we do some study of various
> approaches to achieve this work and come up with a comparison of the
> pros and cons of each approach (a) what this patch provides, (b) what
> is implemented in Global Snapshots patch [1], (c) if possible, what is
> implemented in Postgres-XL.  I fear that if go too far in spending
> effort on this and later discovered that it can be better done via
> some other available patch/work (maybe due to a reasons like that
> approach can easily extended to provide atomic visibility or the
> design is more robust, etc.) then it can lead to a lot of rework.

Yeah, I have no objection to that plan but I think we also need to
keep in mind that (b), (c), and whatever we are thinking about global
consistency are talking about only PostgreSQL (and postgres_fdw). On
the other hand, this patch needs to implement the feature that can
resolve the atomic commit problem more generically, because the
foreign server might be using oracle_fdw, mysql_fdw, or other FDWs
connecting database systems supporting 2PC.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jun 26, 2020 at 10:50 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 23 Jun 2020 at 13:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think at this stage it is important that we do some study of various
> > approaches to achieve this work and come up with a comparison of the
> > pros and cons of each approach (a) what this patch provides, (b) what
> > is implemented in Global Snapshots patch [1], (c) if possible, what is
> > implemented in Postgres-XL.  I fear that if go too far in spending
> > effort on this and later discovered that it can be better done via
> > some other available patch/work (maybe due to a reasons like that
> > approach can easily extended to provide atomic visibility or the
> > design is more robust, etc.) then it can lead to a lot of rework.
>
> Yeah, I have no objection to that plan but I think we also need to
> keep in mind that (b), (c), and whatever we are thinking about global
> consistency are talking about only PostgreSQL (and postgres_fdw).
>

I think we should explore if those approaches could be extended for
FDWs and if not then that could be considered as a disadvantage of
that approach.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Tatsuo Ishii
Date:
>> The point is this data inconsistency is
>> lead by an inconsistent read but not by an inconsistent commit
>> results. I think there are kinds of possibilities causing data
>> inconsistency but atomic commit and atomic visibility eliminate
>> different possibilities. We can eliminate all possibilities of data
>> inconsistency only after we support 2PC and globally MVCC.
> 
> IMO any permanent data inconsistency is a serious problem for users no
> matter what the technical reasons are.

I have incorporated "Pangea" algorithm into Pgpool-II to implement the
atomic visibility. In a test below I have two PostgreSQL servers
(stock v12), server0 (port 11002) and server1 (port
11003). default_transaction_isolation was set to 'repeatable read' on
both PostgreSQL, this is required by Pangea. Pgpool-II replicates
write queries and send them to both server0 and server1. There are two
tables "t1" (having only 1 integer column "i") and "log" (having only
1 integer c column "i"). I have run following script
(inconsistency1.sql) via pgbench:

BEGIN;
UPDATE t1 SET i = i + 1;
END;

like: pgbench -n -c 1 -T 30 -f inconsistency1.sql

In the moment I have run another session from pgbench concurrently:

BEGIN;
INSERT INTO log SELECT * FROM t1;
END;

pgbench -n -c 1 -T 30 -f inconsistency2.sql

After finishing those two pgbench runs, I ran following COPY to see if
contents of table "log" are identical in server0 and server1:
psql -p 11002 -c "\copy log to '11002.txt'"
psql -p 11003 -c "\copy log to '11003.txt'"
cmp 11002.txt 11003.txt

The new Pgpool-II incorporating Pangea showed that 11002.txt and
11003.txt are identical as expected. This indicates that the atomic
visibility are kept.

On the other hand Pgpool-II which does not implement Pangea showed
differences in those files.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
> I've attached the latest version patches. I've incorporated the review
> comments I got so far and improved locking strategy.

Thanks for updating the patch!
I have three questions about the v23 patches.


1. messages related to user canceling

In my understanding, there are two messages
which can be output when a user cancels the COMMIT command.

A. When prepare is failed, the output shows that
    committed locally but some error is occurred.

```
postgres=*# COMMIT;
^CCancel request sent
WARNING:  canceling wait for resolving foreign transaction due to user 
request
DETAIL:  The transaction has already committed locally, but might not 
have been committed on the foreign server.
ERROR:  server closed the connection unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
CONTEXT:  remote SQL command: PREPARE TRANSACTION 
'fx_1020791818_519_16399_10'
```

B. When prepare is succeeded,
    the output show that committed locally.

```
postgres=*# COMMIT;
^CCancel request sent
WARNING:  canceling wait for resolving foreign transaction due to user 
request
DETAIL:  The transaction has already committed locally, but might not 
have been committed on the foreign server.
COMMIT
```

In case of A, I think that "committed locally" message can confuse user.
Because although messages show committed but the transaction is 
"ABORTED".

I think "committed" message means that "ABORT" is committed locally.
But is there a possibility of misunderstanding?

In case of A, it's better to change message for user friendly, isn't it?


2. typo

Is trasnactions in fdwxact.c typo?


3. FdwXactGetWaiter in fdwxact.c return unused value

FdwXactGetWaiter is called in FXRslvLoop function.
It returns *waitXid_p, but FXRslvloop doesn't seem to
use *waitXid_p. Do we need to return it?


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/07/14 9:08, Masahiro Ikeda wrote:
>> I've attached the latest version patches. I've incorporated the review
>> comments I got so far and improved locking strategy.
> 
> Thanks for updating the patch!

+1
I'm interested in these patches and now studying them. While checking
the behaviors of the patched PostgreSQL, I got three comments.

1. We can access to the foreign table even during recovery in the HEAD.
But in the patched version, when I did that, I got the following error.
Is this intentional?

ERROR:  cannot assign TransactionIds during recovery

2. With the patch, when INSERT/UPDATE/DELETE are executed both in
local and remote servers, 2PC is executed at the commit phase. But
when write SQL (e.g., TRUNCATE) except INSERT/UPDATE/DELETE are
executed in local and INSERT/UPDATE/DELETE are executed in remote,
2PC is NOT executed. Is this safe?

3. XACT_FLAGS_WROTENONTEMPREL is set when INSERT/UPDATE/DELETE
are executed. But it's not reset even when those queries are canceled by
ROLLBACK TO SAVEPOINT. This may cause unnecessary 2PC at the commit phase.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
> I've attached the latest version patches. I've incorporated the review
> comments I got so far and improved locking strategy.

I want to ask a question about streaming replication with 2PC.
Are you going to support 2PC with streaming replication?

I tried streaming replication using v23 patches.
I confirm that 2PC works with streaming replication,
which there are primary/standby coordinator.

But, in my understanding, the WAL of "PREPARE" and
"COMMIT/ABORT PREPARED" can't be replicated to the standby server in 
sync.

If this is right, the unresolved transaction can be occurred.

For example,

1. PREPARE is done
2. crash primary before the WAL related to PREPARE is
    replicated to the standby server
3. promote standby server // but can't execute "ABORT PREPARED"

In above case, the remote server has the unresolved transaction.
Can we solve this problem to support in-sync replication?

But, I think some users use async replication for performance.
Do we need to document the limitation or make another solution?

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 14 Jul 2020 at 09:08, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> > I've attached the latest version patches. I've incorporated the review
> > comments I got so far and improved locking strategy.
>
> Thanks for updating the patch!
> I have three questions about the v23 patches.
>
>
> 1. messages related to user canceling
>
> In my understanding, there are two messages
> which can be output when a user cancels the COMMIT command.
>
> A. When prepare is failed, the output shows that
>     committed locally but some error is occurred.
>
> ```
> postgres=*# COMMIT;
> ^CCancel request sent
> WARNING:  canceling wait for resolving foreign transaction due to user
> request
> DETAIL:  The transaction has already committed locally, but might not
> have been committed on the foreign server.
> ERROR:  server closed the connection unexpectedly
>          This probably means the server terminated abnormally
>          before or while processing the request.
> CONTEXT:  remote SQL command: PREPARE TRANSACTION
> 'fx_1020791818_519_16399_10'
> ```
>
> B. When prepare is succeeded,
>     the output show that committed locally.
>
> ```
> postgres=*# COMMIT;
> ^CCancel request sent
> WARNING:  canceling wait for resolving foreign transaction due to user
> request
> DETAIL:  The transaction has already committed locally, but might not
> have been committed on the foreign server.
> COMMIT
> ```
>
> In case of A, I think that "committed locally" message can confuse user.
> Because although messages show committed but the transaction is
> "ABORTED".
>
> I think "committed" message means that "ABORT" is committed locally.
> But is there a possibility of misunderstanding?

No, you're right. I'll fix it in the next version patch.

I think synchronous replication also has the same problem. It says
"the transaction has already committed" but it's not true when
executing ROLLBACK PREPARED.

BTW how did you test the case (A)? It says canceling wait for foreign
transaction resolution but the remote SQL command is PREPARE
TRANSACTION.

>
> In case of A, it's better to change message for user friendly, isn't it?
>
>
> 2. typo
>
> Is trasnactions in fdwxact.c typo?
>

Fixed.

>
> 3. FdwXactGetWaiter in fdwxact.c return unused value
>
> FdwXactGetWaiter is called in FXRslvLoop function.
> It returns *waitXid_p, but FXRslvloop doesn't seem to
> use *waitXid_p. Do we need to return it?

Removed.

I've incorporated the above your comments in the local branch. I'll
post the latest version patch after incorporating other comments soon.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/07/15 15:06, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 09:08, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>>> I've attached the latest version patches. I've incorporated the review
>>> comments I got so far and improved locking strategy.
>>
>> Thanks for updating the patch!
>> I have three questions about the v23 patches.
>>
>>
>> 1. messages related to user canceling
>>
>> In my understanding, there are two messages
>> which can be output when a user cancels the COMMIT command.
>>
>> A. When prepare is failed, the output shows that
>>      committed locally but some error is occurred.
>>
>> ```
>> postgres=*# COMMIT;
>> ^CCancel request sent
>> WARNING:  canceling wait for resolving foreign transaction due to user
>> request
>> DETAIL:  The transaction has already committed locally, but might not
>> have been committed on the foreign server.
>> ERROR:  server closed the connection unexpectedly
>>           This probably means the server terminated abnormally
>>           before or while processing the request.
>> CONTEXT:  remote SQL command: PREPARE TRANSACTION
>> 'fx_1020791818_519_16399_10'
>> ```
>>
>> B. When prepare is succeeded,
>>      the output show that committed locally.
>>
>> ```
>> postgres=*# COMMIT;
>> ^CCancel request sent
>> WARNING:  canceling wait for resolving foreign transaction due to user
>> request
>> DETAIL:  The transaction has already committed locally, but might not
>> have been committed on the foreign server.
>> COMMIT
>> ```
>>
>> In case of A, I think that "committed locally" message can confuse user.
>> Because although messages show committed but the transaction is
>> "ABORTED".
>>
>> I think "committed" message means that "ABORT" is committed locally.
>> But is there a possibility of misunderstanding?
> 
> No, you're right. I'll fix it in the next version patch.
> 
> I think synchronous replication also has the same problem. It says
> "the transaction has already committed" but it's not true when
> executing ROLLBACK PREPARED.

Yes. Also the same message is logged when executing PREPARE TRANSACTION.
Maybe it should be changed to "the transaction has already prepared".

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
On 2020-07-15 15:06, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 09:08, Masahiro Ikeda <ikedamsh@oss.nttdata.com> 
> wrote:
>> 
>> > I've attached the latest version patches. I've incorporated the review
>> > comments I got so far and improved locking strategy.
>> 
>> Thanks for updating the patch!
>> I have three questions about the v23 patches.
>> 
>> 
>> 1. messages related to user canceling
>> 
>> In my understanding, there are two messages
>> which can be output when a user cancels the COMMIT command.
>> 
>> A. When prepare is failed, the output shows that
>>     committed locally but some error is occurred.
>> 
>> ```
>> postgres=*# COMMIT;
>> ^CCancel request sent
>> WARNING:  canceling wait for resolving foreign transaction due to user
>> request
>> DETAIL:  The transaction has already committed locally, but might not
>> have been committed on the foreign server.
>> ERROR:  server closed the connection unexpectedly
>>          This probably means the server terminated abnormally
>>          before or while processing the request.
>> CONTEXT:  remote SQL command: PREPARE TRANSACTION
>> 'fx_1020791818_519_16399_10'
>> ```
>> 
>> B. When prepare is succeeded,
>>     the output show that committed locally.
>> 
>> ```
>> postgres=*# COMMIT;
>> ^CCancel request sent
>> WARNING:  canceling wait for resolving foreign transaction due to user
>> request
>> DETAIL:  The transaction has already committed locally, but might not
>> have been committed on the foreign server.
>> COMMIT
>> ```
>> 
>> In case of A, I think that "committed locally" message can confuse 
>> user.
>> Because although messages show committed but the transaction is
>> "ABORTED".
>> 
>> I think "committed" message means that "ABORT" is committed locally.
>> But is there a possibility of misunderstanding?
> 
> No, you're right. I'll fix it in the next version patch.
> 
> I think synchronous replication also has the same problem. It says
> "the transaction has already committed" but it's not true when
> executing ROLLBACK PREPARED.

Thanks for replying and sharing the synchronous replication problem.

> BTW how did you test the case (A)? It says canceling wait for foreign
> transaction resolution but the remote SQL command is PREPARE
> TRANSACTION.

I think the timing of failure is important for 2PC test.
Since I don't have any good solution to simulate those flexibly,
I use the GDB debugger.

The message of the case (A) is sent
after performing the following operations.

1. Attach the debugger to a backend process.
2. Set a breakpoint to PreCommit_FdwXact() in CommitTransaction().
    // Before PREPARE.
3. Execute "BEGIN" and insert data into two remote foreign tables.
4. Issue a "Commit" command
5. The backend process stops at the breakpoint.
6. Stop a remote foreign server.
7. Detach the debugger.
   // The backend continues and prepare is failed. TR try to abort all 
remote txs.
   // It's unnecessary to resolve remote txs which prepare is failed, 
isn't it?
8. Send a cancel request.


BTW, I concerned that how to test the 2PC patches.
There are many failure patterns, such as failure timing,
failure server/nw (and unexpected recovery), and those combinations...

Though it's best to test those failure patterns automatically,
I have no idea for now, so I manually check some patterns.


> I've incorporated the above your comments in the local branch. I'll
> post the latest version patch after incorporating other comments soon.

OK, Thanks.


Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> > I've attached the latest version patches. I've incorporated the review
> > comments I got so far and improved locking strategy.
>
> I want to ask a question about streaming replication with 2PC.
> Are you going to support 2PC with streaming replication?
>
> I tried streaming replication using v23 patches.
> I confirm that 2PC works with streaming replication,
> which there are primary/standby coordinator.
>
> But, in my understanding, the WAL of "PREPARE" and
> "COMMIT/ABORT PREPARED" can't be replicated to the standby server in
> sync.
>
> If this is right, the unresolved transaction can be occurred.
>
> For example,
>
> 1. PREPARE is done
> 2. crash primary before the WAL related to PREPARE is
>     replicated to the standby server
> 3. promote standby server // but can't execute "ABORT PREPARED"
>
> In above case, the remote server has the unresolved transaction.
> Can we solve this problem to support in-sync replication?
>
> But, I think some users use async replication for performance.
> Do we need to document the limitation or make another solution?
>

IIUC with synchronous replication, we can guarantee that WAL records
are written on both primary and replicas when the client got an
acknowledgment of commit. We don't replicate each WAL records
generated during transaction one by one in sync. In the case you
described, the client will get an error due to the server crash.
Therefore I think the user cannot expect WAL records generated so far
has been replicated. The same issue could happen also when the user
executes PREPARE TRANSACTION and the server crashes. To prevent this
issue, I think we would need to send each WAL records in sync but I'm
not sure it's reasonable behavior, and as long as we write WAL in the
local and then send it to replicas we would need a smart mechanism to
prevent this situation.

Related to the pointing out by Ikeda-san, I realized that with the
current patch the backend waits for synchronous replication and then
waits for foreign transaction resolution. But it should be reversed.
Otherwise, it could lead to data loss even when the client got an
acknowledgment of commit. Also, when the user is using both atomic
commit and synchronous replication and wants to cancel waiting, he/she
will need to press ctl-c twice with the current patch, which also
should be fixed.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
Hi Sawada san,


I'm reviewing this patch series, and let me give some initial comments and questions.  I'm looking at this with a hope
thatthis will be useful purely as a FDW enhancement for our new use cases, regardless of whether the FDW will be used
forPostgres scale-out.
 

I don't think it's necessarily required to combine 2PC with the global visibility.  X/Open XA specification only
handlesthe atomic commit.  The only part in the XA specification that refers to global visibility is the following:
 


[Quote from XA specification]
--------------------------------------------------
2.3.2 Protocol Optimisations 
・ Read-only 
An RM can respond to the TM’s prepare request by asserting that the RM was not 
asked to update shared resources in this transaction branch. This response 
concludes the RM’s involvement in the transaction; the Phase 2 dialogue between 
the TM and this RM does not occur. The TM need not stably record, in its list of 
participating RMs, an RM that asserts a read-only role in the global transaction. 

However, if the RM returns the read-only optimisation before all work on the global 
transaction is prepared, global serialisability1 cannot be guaranteed. This is because 
the RM may release transaction context, such as read locks, before all application 
activity for that global transaction is finished. 

1. 
Serialisability is a property of a set of concurrent transactions. For a serialisable set of transactions, at least one

serial sequence of the transactions exists that produces identical results, with respect to shared resources, as does 
concurrent execution of the transaction. 
--------------------------------------------------


(1)
Do other popular DBMSs (Oracle, MySQL, etc.)  provide concrete functions that can be used for the new FDW
commit/rollback/prepareAPI?  I'm asking this to confirm that we really need to provide these functions, not as the
transactioncallbacks for postgres_fdw.
 


(2)
How are data modifications tracked in local and remote transactions?  0001 seems to handle local INSERT/DELETE/UPDATE.
Especially:

* COPY FROM to local/remote tables/views.

* User-defined function calls that modify data, e.g. SELECT func1() WHERE col = func2()


(3)
Does the 2PC processing always go through the background worker?
Is the group commit effective on the remote server? That is, PREPARE and COMMIT PREPARED issued from multiple remote
sessionsare written to WAL in batch?
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >> I've attached the latest version patches. I've incorporated the review
> >> comments I got so far and improved locking strategy.
> >
> > Thanks for updating the patch!
>
> +1
> I'm interested in these patches and now studying them. While checking
> the behaviors of the patched PostgreSQL, I got three comments.

Thank you for testing this patch!

>
> 1. We can access to the foreign table even during recovery in the HEAD.
> But in the patched version, when I did that, I got the following error.
> Is this intentional?
>
> ERROR:  cannot assign TransactionIds during recovery

No, it should be fixed. I'm going to fix this by not collecting
participants for atomic commit during recovery.

>
> 2. With the patch, when INSERT/UPDATE/DELETE are executed both in
> local and remote servers, 2PC is executed at the commit phase. But
> when write SQL (e.g., TRUNCATE) except INSERT/UPDATE/DELETE are
> executed in local and INSERT/UPDATE/DELETE are executed in remote,
> 2PC is NOT executed. Is this safe?

Hmm, you're right. I think atomic commit must be used also when the
user executes other write SQLs such as TRUNCATE, COPY, CLUSTER, and
CREATE TABLE on the local node.

>
> 3. XACT_FLAGS_WROTENONTEMPREL is set when INSERT/UPDATE/DELETE
> are executed. But it's not reset even when those queries are canceled by
> ROLLBACK TO SAVEPOINT. This may cause unnecessary 2PC at the commit phase.

Will fix.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
On 2020-07-16 13:16, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda <ikedamsh@oss.nttdata.com> 
> wrote:
>> 
>> > I've attached the latest version patches. I've incorporated the review
>> > comments I got so far and improved locking strategy.
>> 
>> I want to ask a question about streaming replication with 2PC.
>> Are you going to support 2PC with streaming replication?
>> 
>> I tried streaming replication using v23 patches.
>> I confirm that 2PC works with streaming replication,
>> which there are primary/standby coordinator.
>> 
>> But, in my understanding, the WAL of "PREPARE" and
>> "COMMIT/ABORT PREPARED" can't be replicated to the standby server in
>> sync.
>> 
>> If this is right, the unresolved transaction can be occurred.
>> 
>> For example,
>> 
>> 1. PREPARE is done
>> 2. crash primary before the WAL related to PREPARE is
>>     replicated to the standby server
>> 3. promote standby server // but can't execute "ABORT PREPARED"
>> 
>> In above case, the remote server has the unresolved transaction.
>> Can we solve this problem to support in-sync replication?
>> 
>> But, I think some users use async replication for performance.
>> Do we need to document the limitation or make another solution?
>> 
> 
> IIUC with synchronous replication, we can guarantee that WAL records
> are written on both primary and replicas when the client got an
> acknowledgment of commit. We don't replicate each WAL records
> generated during transaction one by one in sync. In the case you
> described, the client will get an error due to the server crash.
> Therefore I think the user cannot expect WAL records generated so far
> has been replicated. The same issue could happen also when the user
> executes PREPARE TRANSACTION and the server crashes.

Thanks! I didn't noticed the behavior when a user executes PREPARE 
TRANSACTION is same.

IIUC with 2PC, there is a different point between (1)PREPARE TRANSACTION 
and (2)2PC.
The point is that whether the client can know when the server crashed 
and it's global tx id.

If (1)PREPARE TRANSACTION is failed, it's ok the client execute same 
command
because if the remote server is already prepared the command will be 
ignored.

But, if (2)2PC is failed with coordinator crash, the client can't know 
what operations should be done.

If the old coordinator already executed PREPARED, there are some 
transaction which should be ABORT PREPARED.
But if the PREPARED WAL is not sent to the standby, the new coordinator 
can't execute ABORT PREPARED.
And the client can't know which remote servers have PREPARED 
transactions which should be ABORTED either.

Even if the client can know that, only the old coordinator knows its 
global transaction id.
Only the database administrator can analyze the old coordinator's log
and then execute the appropriate commands manually, right?


> To prevent this
> issue, I think we would need to send each WAL records in sync but I'm
> not sure it's reasonable behavior, and as long as we write WAL in the
> local and then send it to replicas we would need a smart mechanism to
> prevent this situation.

I agree. To send each 2PC WAL records  in sync must be with a large 
performance impact.
At least, we need to document the limitation and how to handle this 
situation.


> Related to the pointing out by Ikeda-san, I realized that with the
> current patch the backend waits for synchronous replication and then
> waits for foreign transaction resolution. But it should be reversed.
> Otherwise, it could lead to data loss even when the client got an
> acknowledgment of commit. Also, when the user is using both atomic
> commit and synchronous replication and wants to cancel waiting, he/she
> will need to press ctl-c twice with the current patch, which also
> should be fixed.

I'm sorry that I can't understood.

In my understanding, if COMMIT WAL is replicated to the standby in sync,
the standby server can resolve the transaction after crash recovery in 
promoted phase.

If reversed, there are some situation which can't guarantee atomic 
commit.
In case that some foreign transaction resolutions are succeed but others 
are failed(and COMMIT WAL is not replicated),
the standby must ABORT PREPARED because the COMMIT WAL is not 
replicated.
This means that some  foreign transactions are COMMITE PREPARED executed 
by primary coordinator,
other foreign transactions can be ABORT PREPARED executed by secondary 
coordinator.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 16 Jul 2020 at 13:53, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> Hi Sawada san,
>
>
> I'm reviewing this patch series, and let me give some initial comments and questions.  I'm looking at this with a
hopethat this will be useful purely as a FDW enhancement for our new use cases, regardless of whether the FDW will be
usedfor Postgres scale-out. 

Thank you for reviewing this patch!

Yes, this patch is trying to resolve the generic atomic commit problem
w.r.t. FDW, and will be useful also for Postgres scale-out.

>
> I don't think it's necessarily required to combine 2PC with the global visibility.  X/Open XA specification only
handlesthe atomic commit.  The only part in the XA specification that refers to global visibility is the following: 
>
>
> [Quote from XA specification]
> --------------------------------------------------
> 2.3.2 Protocol Optimisations
> ・ Read-only
> An RM can respond to the TM’s prepare request by asserting that the RM was not
> asked to update shared resources in this transaction branch. This response
> concludes the RM’s involvement in the transaction; the Phase 2 dialogue between
> the TM and this RM does not occur. The TM need not stably record, in its list of
> participating RMs, an RM that asserts a read-only role in the global transaction.
>
> However, if the RM returns the read-only optimisation before all work on the global
> transaction is prepared, global serialisability1 cannot be guaranteed. This is because
> the RM may release transaction context, such as read locks, before all application
> activity for that global transaction is finished.
>
> 1.
> Serialisability is a property of a set of concurrent transactions. For a serialisable set of transactions, at least
one
> serial sequence of the transactions exists that produces identical results, with respect to shared resources, as does
> concurrent execution of the transaction.
> --------------------------------------------------
>

Agreed.

>
> (1)
> Do other popular DBMSs (Oracle, MySQL, etc.)  provide concrete functions that can be used for the new FDW
commit/rollback/prepareAPI?  I'm asking this to confirm that we really need to provide these functions, not as the
transactioncallbacks for postgres_fdw. 
>

I have briefly checked the only oracle_fdw but in general I think that
if an existing FDW supports transaction begin, commit, and rollback,
these can be ported to new FDW transaction APIs easily.

Regarding the comparison between FDW transaction APIs and transaction
callbacks, I think one of the benefits of providing FDW transaction
APIs is that the core is able to manage the status of foreign
transactions. We need to track the status of individual foreign
transactions to support atomic commit. If we use transaction callbacks
(XactCallback) that many FDWs are using, I think we will end up
calling the transaction callback and leave the transaction work to
FDWs, leading that the core is not able to know the return values of
PREPARE TRANSACTION for example. We can add more arguments passed to
transaction callbacks to get the return value from FDWs but I don’t
think it’s a good idea as transaction callbacks are used not only by
FDW but also other external modules.

>
> (2)
> How are data modifications tracked in local and remote transactions?  0001 seems to handle local
INSERT/DELETE/UPDATE. Especially: 
>
> * COPY FROM to local/remote tables/views.
>
> * User-defined function calls that modify data, e.g. SELECT func1() WHERE col = func2()
>

With the current version patch (v23), it supports only
INSERT/DELETE/UPDATE. But I'm going to change the patch so that it
supports other writes SQLs as Fujii-san also pointed out.

>
> (3)
> Does the 2PC processing always go through the background worker?
> Is the group commit effective on the remote server? That is, PREPARE and COMMIT PREPARED issued from multiple remote
sessionsare written to WAL in batch? 

No, in the current design, the backend who received a query from the
client does PREPARE, and then the transaction resolver process, a
background worker, does COMMIT PREPARED.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
I have briefly checked the only oracle_fdw but in general I think that
> if an existing FDW supports transaction begin, commit, and rollback,
> these can be ported to new FDW transaction APIs easily.

Does oracle_fdw support begin, commit and rollback?

And most importantly, do other major DBMSs, including Oracle, provide the API for preparing a transaction?  In other
words,will the FDWs other than postgres_fdw really be able to take advantage of the new FDW functions to join the 2PC
processing? I think we need to confirm that there are concrete examples.
 

What I'm worried is that if only postgres_fdw can implement the prepare function, it's a sign that FDW interface will
beriddled with functions only for Postgres.  That is, the FDW interface is getting away from its original purpose
"accessexternal data as a relation" and complex.  Tomas Vondra showed this concern as follows:
 

Horizontal scalability/sharding 

https://www.postgresql.org/message-id/flat/CANP8%2BjK%3D%2B3zVYDFY0oMAQKQVJ%2BqReDHr1UPdyFEELO82yVfb9A%40mail.gmail.com#2c45f0ee97855449f1f7fedcef1d5e11


[Tomas Vondra's remarks]
--------------------------------------------------
> This strikes me as a bit of a conflict of interest with FDW which
> seems to want to hide the fact that it's foreign; the FDW
> implementation makes it's own optimization decisions which might
> make sense for single table queries but breaks down in the face of
> joins.

+1 to these concerns

In my mind, FDW is a wonderful tool to integrate PostgreSQL with 
external data sources, and it's nicely shaped for this purpose, which 
implies the abstractions and assumptions in the code.

The truth however is that many current uses of the FDW API are actually 
using it for different purposes because there's no other way to do that, 
not because FDWs are the "right way". And this includes the attempts to 
build sharding on FDW, I think.

Situations like this result in "improvements" of the API that seem to 
improve the API for the second group, but make the life harder for the 
original FDW API audience by making the API needlessly complex. And I 
say "seem to improve" because the second group eventually runs into the 
fundamental abstractions and assumptions the API is based on anyway.

And based on the discussions at pgcon, I think this is the main reason 
why people cringe when they hear "FDW" and "sharding" in the same sentence.

...
My other worry is that we'll eventually mess the FDW infrastructure, 
making it harder to use for the original purpose. Granted, most of the 
improvements proposed so far look sane and useful for FDWs in general, 
but sooner or later that ceases to be the case - there sill be changes 
needed merely for the sharding. Those will be tough decisions.
--------------------------------------------------


> Regarding the comparison between FDW transaction APIs and transaction
> callbacks, I think one of the benefits of providing FDW transaction
> APIs is that the core is able to manage the status of foreign
> transactions. We need to track the status of individual foreign
> transactions to support atomic commit. If we use transaction callbacks
> (XactCallback) that many FDWs are using, I think we will end up
> calling the transaction callback and leave the transaction work to
> FDWs, leading that the core is not able to know the return values of
> PREPARE TRANSACTION for example. We can add more arguments passed to
> transaction callbacks to get the return value from FDWs but I don’t
> think it’s a good idea as transaction callbacks are used not only by
> FDW but also other external modules.

To track the foreign transaction status, we can add GetTransactionStatus() to the FDW interface as an alternative,
can'twe?
 


> With the current version patch (v23), it supports only
> INSERT/DELETE/UPDATE. But I'm going to change the patch so that it
> supports other writes SQLs as Fujii-san also pointed out.

OK.  I've just read that Fujii san already pointed out a similar thing.  But I wonder if we can know that the UDF
executedon the foreign server has updated data.  Maybe we can know or guess it by calling txid_current_if_any() or
checkingthe transaction status in FE/BE protocol, but can we deal with other FDWs other than postgres_fdw?
 


> No, in the current design, the backend who received a query from the
> client does PREPARE, and then the transaction resolver process, a
> background worker, does COMMIT PREPARED.

This "No" means the current implementation cannot group commits from multiple transactions?
Does the transaction resolver send COMMIT PREPARED and waits for its response for each transaction one by one?  For
example,

[local server]
Transaction T1 and T2 performs 2PC at the same time.
Transaction resolver sends COMMIT PREPARED for T1 and then waits for the response.
T1 writes COMMIT PREPARED record locally and sync the WAL.
Transaction resolver sends COMMIT PREPARED for T2 and then waits for the response.
T2 writes COMMIT PREPARED record locally and sync the WAL.

[foreign server]
T1 writes COMMIT PREPARED record locally and sync the WAL.
T2 writes COMMIT PREPARED record locally and sync the WAL.

If the WAL records of multiple concurrent transactions are written and synced separately, i.e. group commit doesn't
takeeffect, then the OLTP transaction performance will be unacceptable.
 


Regards
Takayuki Tsunakawa



Re: Transactions involving multiple postgres foreign servers, take 2

From
Laurenz Albe
Date:
On Fri, 2020-07-17 at 05:21 +0000, tsunakawa.takay@fujitsu.com wrote:
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I have briefly checked the only oracle_fdw but in general I think that
> > if an existing FDW supports transaction begin, commit, and rollback,
> > these can be ported to new FDW transaction APIs easily.
> 
> Does oracle_fdw support begin, commit and rollback?

Yes.

> And most importantly, do other major DBMSs, including Oracle, provide the API for
> preparing a transaction?  In other words, will the FDWs other than postgres_fdw
> really be able to take advantage of the new FDW functions to join the 2PC processing?
> I think we need to confirm that there are concrete examples.

I bet they do.  There is even a standard for that.

I am not looking forward to adapting oracle_fdw, and I didn't read the patch.

But using distributed transactions is certainly a good thing if it is done right.

The trade off is the need for a transaction manager, and implementing that
correctly is a high price to pay.

Yours,
Laurenz Albe




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 17 Jul 2020 at 11:06, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> On 2020-07-16 13:16, Masahiko Sawada wrote:
> > On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda <ikedamsh@oss.nttdata.com>
> > wrote:
> >>
> >> > I've attached the latest version patches. I've incorporated the review
> >> > comments I got so far and improved locking strategy.
> >>
> >> I want to ask a question about streaming replication with 2PC.
> >> Are you going to support 2PC with streaming replication?
> >>
> >> I tried streaming replication using v23 patches.
> >> I confirm that 2PC works with streaming replication,
> >> which there are primary/standby coordinator.
> >>
> >> But, in my understanding, the WAL of "PREPARE" and
> >> "COMMIT/ABORT PREPARED" can't be replicated to the standby server in
> >> sync.
> >>
> >> If this is right, the unresolved transaction can be occurred.
> >>
> >> For example,
> >>
> >> 1. PREPARE is done
> >> 2. crash primary before the WAL related to PREPARE is
> >>     replicated to the standby server
> >> 3. promote standby server // but can't execute "ABORT PREPARED"
> >>
> >> In above case, the remote server has the unresolved transaction.
> >> Can we solve this problem to support in-sync replication?
> >>
> >> But, I think some users use async replication for performance.
> >> Do we need to document the limitation or make another solution?
> >>
> >
> > IIUC with synchronous replication, we can guarantee that WAL records
> > are written on both primary and replicas when the client got an
> > acknowledgment of commit. We don't replicate each WAL records
> > generated during transaction one by one in sync. In the case you
> > described, the client will get an error due to the server crash.
> > Therefore I think the user cannot expect WAL records generated so far
> > has been replicated. The same issue could happen also when the user
> > executes PREPARE TRANSACTION and the server crashes.
>
> Thanks! I didn't noticed the behavior when a user executes PREPARE
> TRANSACTION is same.
>
> IIUC with 2PC, there is a different point between (1)PREPARE TRANSACTION
> and (2)2PC.
> The point is that whether the client can know when the server crashed
> and it's global tx id.
>
> If (1)PREPARE TRANSACTION is failed, it's ok the client execute same
> command
> because if the remote server is already prepared the command will be
> ignored.
>
> But, if (2)2PC is failed with coordinator crash, the client can't know
> what operations should be done.
>
> If the old coordinator already executed PREPARED, there are some
> transaction which should be ABORT PREPARED.
> But if the PREPARED WAL is not sent to the standby, the new coordinator
> can't execute ABORT PREPARED.
> And the client can't know which remote servers have PREPARED
> transactions which should be ABORTED either.
>
> Even if the client can know that, only the old coordinator knows its
> global transaction id.
> Only the database administrator can analyze the old coordinator's log
> and then execute the appropriate commands manually, right?

I think that's right. In the case of the coordinator crash, the user
can look orphaned foreign prepared transactions by checking the
'identifier' column of pg_foreign_xacts on the new standby server and
the prepared transactions on the remote servers.

>
>
> > To prevent this
> > issue, I think we would need to send each WAL records in sync but I'm
> > not sure it's reasonable behavior, and as long as we write WAL in the
> > local and then send it to replicas we would need a smart mechanism to
> > prevent this situation.
>
> I agree. To send each 2PC WAL records  in sync must be with a large
> performance impact.
> At least, we need to document the limitation and how to handle this
> situation.

Ok. I'll add it.

>
>
> > Related to the pointing out by Ikeda-san, I realized that with the
> > current patch the backend waits for synchronous replication and then
> > waits for foreign transaction resolution. But it should be reversed.
> > Otherwise, it could lead to data loss even when the client got an
> > acknowledgment of commit. Also, when the user is using both atomic
> > commit and synchronous replication and wants to cancel waiting, he/she
> > will need to press ctl-c twice with the current patch, which also
> > should be fixed.
>
> I'm sorry that I can't understood.
>
> In my understanding, if COMMIT WAL is replicated to the standby in sync,
> the standby server can resolve the transaction after crash recovery in
> promoted phase.
>
> If reversed, there are some situation which can't guarantee atomic
> commit.
> In case that some foreign transaction resolutions are succeed but others
> are failed(and COMMIT WAL is not replicated),
> the standby must ABORT PREPARED because the COMMIT WAL is not
> replicated.
> This means that some  foreign transactions are COMMITE PREPARED executed
> by primary coordinator,
> other foreign transactions can be ABORT PREPARED executed by secondary
> coordinator.

You're right. Thank you for pointing out!

If the coordinator crashes after the client gets acknowledgment of the
successful commit of the transaction but before sending
XLOG_FDWXACT_REMOVE record to the replicas, the FdwXact entries are
left on the replicas even after failover. But since we require FDW to
tolerate the error of undefined prepared transactions in
COMMIT/ROLLBACK PREPARED it won’t be a critical problem.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Laurenz Albe <laurenz.albe@cybertec.at>
> On Fri, 2020-07-17 at 05:21 +0000, tsunakawa.takay@fujitsu.com wrote:
> > And most importantly, do other major DBMSs, including Oracle, provide the
> API for
> > preparing a transaction?  In other words, will the FDWs other than
> postgres_fdw
> > really be able to take advantage of the new FDW functions to join the 2PC
> processing?
> > I think we need to confirm that there are concrete examples.
> 
> I bet they do.  There is even a standard for that.

If you're thinking of xa_prepare() defined in the X/Open XA specification, we need to be sure that other FDWs can
reallyutilize this new 2PC mechanism.  What I'm especially wondering is when the FDW can call xa_start().
 


Regards
Takayuki Tsunakawa



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 17 Jul 2020 at 14:22, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I have briefly checked the only oracle_fdw but in general I think that
> > if an existing FDW supports transaction begin, commit, and rollback,
> > these can be ported to new FDW transaction APIs easily.
>
> Does oracle_fdw support begin, commit and rollback?
>
> And most importantly, do other major DBMSs, including Oracle, provide the API for preparing a transaction?  In other
words,will the FDWs other than postgres_fdw really be able to take advantage of the new FDW functions to join the 2PC
processing? I think we need to confirm that there are concrete examples. 

I also believe they do. But I'm concerned that some FDW needs to start
a transaction differently when using 2PC. For instance, IIUC MySQL
also supports 2PC but the transaction needs to be started with "XA
START id” when the transaction needs to be prepared. The transaction
started with XA START can be closed by XA END followed by XA PREPARE
or XA COMMIT ONE PHASE. It means that when starts a new transaction
the transaction needs to prepare the transaction identifier and to
know that 2PC might be used. It’s quite different from PostgreSQL. In
PostgreSQL, we can start a transaction by BEGIN and end it by PREPARE
TRANSACTION, COMMIT, or ROLLBACK. The transaction identifier is
required when PREPARE TRANSACTION.

With MySQL, I guess FDW needs a way to tell the (next) transaction
needs to be started with XA START so it can be prepared. It could be a
custom GUC or an SQL function. Then when starts a new transaction on
MySQL server, FDW can generate and store a transaction identifier into
somewhere alongside the connection. At the prepare phase, it passes
the transaction identifier via GetPrepareId() API to the core.

I haven’t tested the above yet and it’s just a desk plan. it's
definitely a good idea to try integrating this 2PC feature to FDWs
other than postgres_fdw to see if design and interfaces are
implemented sophisticatedly.

>
> What I'm worried is that if only postgres_fdw can implement the prepare function, it's a sign that FDW interface will
beriddled with functions only for Postgres.  That is, the FDW interface is getting away from its original purpose
"accessexternal data as a relation" and complex.  Tomas Vondra showed this concern as follows: 
>
> Horizontal scalability/sharding
>
https://www.postgresql.org/message-id/flat/CANP8%2BjK%3D%2B3zVYDFY0oMAQKQVJ%2BqReDHr1UPdyFEELO82yVfb9A%40mail.gmail.com#2c45f0ee97855449f1f7fedcef1d5e11
>
>
> [Tomas Vondra's remarks]
> --------------------------------------------------
> > This strikes me as a bit of a conflict of interest with FDW which
> > seems to want to hide the fact that it's foreign; the FDW
> > implementation makes it's own optimization decisions which might
> > make sense for single table queries but breaks down in the face of
> > joins.
>
> +1 to these concerns
>
> In my mind, FDW is a wonderful tool to integrate PostgreSQL with
> external data sources, and it's nicely shaped for this purpose, which
> implies the abstractions and assumptions in the code.
>
> The truth however is that many current uses of the FDW API are actually
> using it for different purposes because there's no other way to do that,
> not because FDWs are the "right way". And this includes the attempts to
> build sharding on FDW, I think.
>
> Situations like this result in "improvements" of the API that seem to
> improve the API for the second group, but make the life harder for the
> original FDW API audience by making the API needlessly complex. And I
> say "seem to improve" because the second group eventually runs into the
> fundamental abstractions and assumptions the API is based on anyway.
>
> And based on the discussions at pgcon, I think this is the main reason
> why people cringe when they hear "FDW" and "sharding" in the same sentence.
>
> ...
> My other worry is that we'll eventually mess the FDW infrastructure,
> making it harder to use for the original purpose. Granted, most of the
> improvements proposed so far look sane and useful for FDWs in general,
> but sooner or later that ceases to be the case - there sill be changes
> needed merely for the sharding. Those will be tough decisions.
> --------------------------------------------------
>
>
> > Regarding the comparison between FDW transaction APIs and transaction
> > callbacks, I think one of the benefits of providing FDW transaction
> > APIs is that the core is able to manage the status of foreign
> > transactions. We need to track the status of individual foreign
> > transactions to support atomic commit. If we use transaction callbacks
> > (XactCallback) that many FDWs are using, I think we will end up
> > calling the transaction callback and leave the transaction work to
> > FDWs, leading that the core is not able to know the return values of
> > PREPARE TRANSACTION for example. We can add more arguments passed to
> > transaction callbacks to get the return value from FDWs but I don’t
> > think it’s a good idea as transaction callbacks are used not only by
> > FDW but also other external modules.
>
> To track the foreign transaction status, we can add GetTransactionStatus() to the FDW interface as an alternative,
can'twe? 

I haven't thought such an interface but it sounds like the transaction
status is managed on both the core and FDWs. Could you elaborate on
that?

>
>
> > With the current version patch (v23), it supports only
> > INSERT/DELETE/UPDATE. But I'm going to change the patch so that it
> > supports other writes SQLs as Fujii-san also pointed out.
>
> OK.  I've just read that Fujii san already pointed out a similar thing.  But I wonder if we can know that the UDF
executedon the foreign server has updated data.  Maybe we can know or guess it by calling txid_current_if_any() or
checkingthe transaction status in FE/BE protocol, but can we deal with other FDWs other than postgres_fdw? 

Ah, my answer was not enough. It was only about tracking local writes.

Regarding tracking of writes on the foreign server, I think there are
restrictions. Currently, the executor registers a foreign sever as a
participant of 2PC before calling BeginForeignInsert(),
BeginForeignModify(), and BeginForeignScan() etc with a flag
indicating whether writes is going to happen on the foreign server. So
even if an UDF in a SELECT statement that could update data were to be
pushed down to the foreign server,  the foreign server would be marked
as *not* modified. I’ve not tested yet but I guess that since FDW also
is allowed to register the foreign server along with that flag anytime
before commit, FDW is able to forcibly change that flag if it knows
the SELECT query is going to modify the data on the remote server.

>
>
> > No, in the current design, the backend who received a query from the
> > client does PREPARE, and then the transaction resolver process, a
> > background worker, does COMMIT PREPARED.
>
> This "No" means the current implementation cannot group commits from multiple transactions?

Yes.

> Does the transaction resolver send COMMIT PREPARED and waits for its response for each transaction one by one?  For
example,
>
> [local server]
> Transaction T1 and T2 performs 2PC at the same time.
> Transaction resolver sends COMMIT PREPARED for T1 and then waits for the response.
> T1 writes COMMIT PREPARED record locally and sync the WAL.
> Transaction resolver sends COMMIT PREPARED for T2 and then waits for the response.
> T2 writes COMMIT PREPARED record locally and sync the WAL.
>
> [foreign server]
> T1 writes COMMIT PREPARED record locally and sync the WAL.
> T2 writes COMMIT PREPARED record locally and sync the WAL.

Just to be clear, the transaction resolver writes FDWXACT_REMOVE
records instead of COMMIT PREPARED record to remove foreign
transaction entry. But, yes, the transaction resolver works like the
above you explained.

> If the WAL records of multiple concurrent transactions are written and synced separately, i.e. group commit doesn't
takeeffect, then the OLTP transaction performance will be unacceptable. 

I agree that it'll be a large performance penalty. I'd like to have it
but I’m not sure we should have it in the first version from the
perspective of complexity. Since the procedure of 2PC is originally
high cost, in my opinion, the user should not use as much as possible
in terms of performance. Especially in OLTP, its cost will directly
affect the latency. I’d suggest designing database schema so
transaction touches only one foreign server but do you have concrete
OLTP usecase where normally requires 2PC, and how many servers
involved within a distributed transaction?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/07/17 20:04, Masahiko Sawada wrote:
> On Fri, 17 Jul 2020 at 14:22, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
>>
>> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
>> I have briefly checked the only oracle_fdw but in general I think that
>>> if an existing FDW supports transaction begin, commit, and rollback,
>>> these can be ported to new FDW transaction APIs easily.
>>
>> Does oracle_fdw support begin, commit and rollback?
>>
>> And most importantly, do other major DBMSs, including Oracle, provide the API for preparing a transaction?  In other
words,will the FDWs other than postgres_fdw really be able to take advantage of the new FDW functions to join the 2PC
processing? I think we need to confirm that there are concrete examples.
 
> 
> I also believe they do. But I'm concerned that some FDW needs to start
> a transaction differently when using 2PC. For instance, IIUC MySQL
> also supports 2PC but the transaction needs to be started with "XA
> START id” when the transaction needs to be prepared. The transaction
> started with XA START can be closed by XA END followed by XA PREPARE
> or XA COMMIT ONE PHASE.

This means that FDW should provide also the API for xa_end()?
Maybe we need to consider again which API we should provide in FDW,
based on XA specification?


> It means that when starts a new transaction
> the transaction needs to prepare the transaction identifier and to
> know that 2PC might be used. It’s quite different from PostgreSQL. In
> PostgreSQL, we can start a transaction by BEGIN and end it by PREPARE
> TRANSACTION, COMMIT, or ROLLBACK. The transaction identifier is
> required when PREPARE TRANSACTION.
> 
> With MySQL, I guess FDW needs a way to tell the (next) transaction
> needs to be started with XA START so it can be prepared. It could be a
> custom GUC or an SQL function. Then when starts a new transaction on
> MySQL server, FDW can generate and store a transaction identifier into
> somewhere alongside the connection. At the prepare phase, it passes
> the transaction identifier via GetPrepareId() API to the core.
> 
> I haven’t tested the above yet and it’s just a desk plan. it's
> definitely a good idea to try integrating this 2PC feature to FDWs
> other than postgres_fdw to see if design and interfaces are
> implemented sophisticatedly.

With the current patch, we track whether write queries are executed
in each server. Then, if the number of servers that execute write queries
is less than two, 2PC is skipped. This "optimization" is not necessary
(cannot be applied) when using mysql_fdw because the transaction starts
with XA START. Right?

If that's the "optimization" only for postgres_fdw, maybe it's better to
get rid of that "optimization" from the first patch, to make the patch simpler.

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/07/16 14:47, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>> I've attached the latest version patches. I've incorporated the review
>>>> comments I got so far and improved locking strategy.
>>>
>>> Thanks for updating the patch!
>>
>> +1
>> I'm interested in these patches and now studying them. While checking
>> the behaviors of the patched PostgreSQL, I got three comments.
> 
> Thank you for testing this patch!
> 
>>
>> 1. We can access to the foreign table even during recovery in the HEAD.
>> But in the patched version, when I did that, I got the following error.
>> Is this intentional?
>>
>> ERROR:  cannot assign TransactionIds during recovery
> 
> No, it should be fixed. I'm going to fix this by not collecting
> participants for atomic commit during recovery.

Thanks for trying to fix the issues!

I'd like to report one more issue. When I started new transaction
in the local server, executed INSERT in the remote server via
postgres_fdw and then quit psql, I got the following assertion failure.

TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
3   postgres                            0x000000010d313fe3 shmem_exit + 179
4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
5   postgres                            0x000000010d313da3 proc_exit + 19
6   postgres                            0x000000010d35112f PostgresMain + 3711
7   postgres                            0x000000010d27bb3a BackendRun + 570
8   postgres                            0x000000010d27af6b BackendStartup + 475
9   postgres                            0x000000010d279ed1 ServerLoop + 593
10  postgres                            0x000000010d277940 PostmasterMain + 6016
11  postgres                            0x000000010d1597b9 main + 761
12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
13  ???                                 0x0000000000000003 0x0 + 3

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
On 2020-07-17 15:55, Masahiko Sawada wrote:
> On Fri, 17 Jul 2020 at 11:06, Masahiro Ikeda <ikedamsh@oss.nttdata.com> 
> wrote:
>> 
>> On 2020-07-16 13:16, Masahiko Sawada wrote:
>> > On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda <ikedamsh@oss.nttdata.com>
>> > wrote:
>> >>
>> >> > I've attached the latest version patches. I've incorporated the review
>> >> > comments I got so far and improved locking strategy.
>> >>
>> >> I want to ask a question about streaming replication with 2PC.
>> >> Are you going to support 2PC with streaming replication?
>> >>
>> >> I tried streaming replication using v23 patches.
>> >> I confirm that 2PC works with streaming replication,
>> >> which there are primary/standby coordinator.
>> >>
>> >> But, in my understanding, the WAL of "PREPARE" and
>> >> "COMMIT/ABORT PREPARED" can't be replicated to the standby server in
>> >> sync.
>> >>
>> >> If this is right, the unresolved transaction can be occurred.
>> >>
>> >> For example,
>> >>
>> >> 1. PREPARE is done
>> >> 2. crash primary before the WAL related to PREPARE is
>> >>     replicated to the standby server
>> >> 3. promote standby server // but can't execute "ABORT PREPARED"
>> >>
>> >> In above case, the remote server has the unresolved transaction.
>> >> Can we solve this problem to support in-sync replication?
>> >>
>> >> But, I think some users use async replication for performance.
>> >> Do we need to document the limitation or make another solution?
>> >>
>> >
>> > IIUC with synchronous replication, we can guarantee that WAL records
>> > are written on both primary and replicas when the client got an
>> > acknowledgment of commit. We don't replicate each WAL records
>> > generated during transaction one by one in sync. In the case you
>> > described, the client will get an error due to the server crash.
>> > Therefore I think the user cannot expect WAL records generated so far
>> > has been replicated. The same issue could happen also when the user
>> > executes PREPARE TRANSACTION and the server crashes.
>> 
>> Thanks! I didn't noticed the behavior when a user executes PREPARE
>> TRANSACTION is same.
>> 
>> IIUC with 2PC, there is a different point between (1)PREPARE 
>> TRANSACTION
>> and (2)2PC.
>> The point is that whether the client can know when the server crashed
>> and it's global tx id.
>> 
>> If (1)PREPARE TRANSACTION is failed, it's ok the client execute same
>> command
>> because if the remote server is already prepared the command will be
>> ignored.
>> 
>> But, if (2)2PC is failed with coordinator crash, the client can't know
>> what operations should be done.
>> 
>> If the old coordinator already executed PREPARED, there are some
>> transaction which should be ABORT PREPARED.
>> But if the PREPARED WAL is not sent to the standby, the new 
>> coordinator
>> can't execute ABORT PREPARED.
>> And the client can't know which remote servers have PREPARED
>> transactions which should be ABORTED either.
>> 
>> Even if the client can know that, only the old coordinator knows its
>> global transaction id.
>> Only the database administrator can analyze the old coordinator's log
>> and then execute the appropriate commands manually, right?
> 
> I think that's right. In the case of the coordinator crash, the user
> can look orphaned foreign prepared transactions by checking the
> 'identifier' column of pg_foreign_xacts on the new standby server and
> the prepared transactions on the remote servers.

I think there is a case we can't check orphaned foreign
prepared transaction in pg_foreign_xacts view on the new standby server.
It confuses users and database administrators.

If the primary coordinator crashes after preparing foreign transaction,
but before sending XLOG_FDWXACT_INSERT records to the standby server,
the standby server can't restore their transaction status and
pg_foreign_xacts view doesn't show the prepared foreign transactions.

To send XLOG_FDWXACT_INSERT records asynchronously leads this problem.

>> > To prevent this
>> > issue, I think we would need to send each WAL records in sync but I'm
>> > not sure it's reasonable behavior, and as long as we write WAL in the
>> > local and then send it to replicas we would need a smart mechanism to
>> > prevent this situation.
>> 
>> I agree. To send each 2PC WAL records  in sync must be with a large
>> performance impact.
>> At least, we need to document the limitation and how to handle this
>> situation.
> 
> Ok. I'll add it.

Thanks a lot.

>> > Related to the pointing out by Ikeda-san, I realized that with the
>> > current patch the backend waits for synchronous replication and then
>> > waits for foreign transaction resolution. But it should be reversed.
>> > Otherwise, it could lead to data loss even when the client got an
>> > acknowledgment of commit. Also, when the user is using both atomic
>> > commit and synchronous replication and wants to cancel waiting, he/she
>> > will need to press ctl-c twice with the current patch, which also
>> > should be fixed.
>> 
>> I'm sorry that I can't understood.
>> 
>> In my understanding, if COMMIT WAL is replicated to the standby in 
>> sync,
>> the standby server can resolve the transaction after crash recovery in
>> promoted phase.
>> 
>> If reversed, there are some situation which can't guarantee atomic
>> commit.
>> In case that some foreign transaction resolutions are succeed but 
>> others
>> are failed(and COMMIT WAL is not replicated),
>> the standby must ABORT PREPARED because the COMMIT WAL is not
>> replicated.
>> This means that some  foreign transactions are COMMITE PREPARED 
>> executed
>> by primary coordinator,
>> other foreign transactions can be ABORT PREPARED executed by secondary
>> coordinator.
> 
> You're right. Thank you for pointing out!
> 
> If the coordinator crashes after the client gets acknowledgment of the
> successful commit of the transaction but before sending
> XLOG_FDWXACT_REMOVE record to the replicas, the FdwXact entries are
> left on the replicas even after failover. But since we require FDW to
> tolerate the error of undefined prepared transactions in
> COMMIT/ROLLBACK PREPARED it won’t be a critical problem.

I agree. It's ok that the primary coordinator sends
XLOG_FDWXACT_REMOVE records asynchronously.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Fri, Jul 17, 2020 at 8:38 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 16 Jul 2020 at 13:53, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > Hi Sawada san,
> >
> >
> > I'm reviewing this patch series, and let me give some initial comments and questions.  I'm looking at this with a
hopethat this will be useful purely as a FDW enhancement for our new use cases, regardless of whether the FDW will be
usedfor Postgres scale-out. 
>
> Thank you for reviewing this patch!
>
> Yes, this patch is trying to resolve the generic atomic commit problem
> w.r.t. FDW, and will be useful also for Postgres scale-out.
>

I think it is important to get a consensus on this point.  If I
understand correctly, Tsunakawa-San doesn't sound to be convinced that
FDW can be used for postgres scale-out and we are trying to paint this
feature as a step forward in the scale-out direction.  As per my
understanding, we don't have a very clear vision whether we will be
able to achieve the other important aspects of scale-out feature like
global visibility if we go in this direction and that is the reason I
have insisted in this and the other related thread [1] to at least
have a high-level idea of the same before going too far with this
patch. It is quite possible that after spending months of efforts to
straighten out this patch/feature, we came to the conclusion that this
need to be re-designed or requires a lot of re-work to ensure that it
can be extended for global visibility.  It is better to spend some
effort up front to see if the proposed patch is a stepping stone for
achieving what we want w.r.t postgres scale-out.


[1] - https://www.postgresql.org/message-id/07b2c899-4ed0-4c87-1327-23c750311248%40postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I also believe they do. But I'm concerned that some FDW needs to start
> a transaction differently when using 2PC. For instance, IIUC MySQL
> also supports 2PC but the transaction needs to be started with "XA
> START id” when the transaction needs to be prepared. The transaction
> started with XA START can be closed by XA END followed by XA PREPARE
> or XA COMMIT ONE PHASE. It means that when starts a new transaction
> the transaction needs to prepare the transaction identifier and to
> know that 2PC might be used. It’s quite different from PostgreSQL. In
> PostgreSQL, we can start a transaction by BEGIN and end it by PREPARE
> TRANSACTION, COMMIT, or ROLLBACK. The transaction identifier is
> required when PREPARE TRANSACTION.

I guess Postgres is rather a minority in this regard.  All I know is XA and its Java counterpart (Java Transaction API:
JTA). In XA, the connection needs to be associated with an XID before its transaction work is performed.
 
If some transaction work is already done before associating with XID, xa_start() returns an error like this:

[XA specification]
--------------------------------------------------
[XAER_OUTSIDE] 
The resource manager is doing work outside any global transaction on behalf of 
the application. 
--------------------------------------------------


[Java Transaction API (JTA)]
--------------------------------------------------
void start(Xid xid, int flags) throws XAException 

This method starts work on behalf of a transaction branch. 
...

3.4.7 Local and Global Transactions 
The resource adapter is encouraged to support the usage of both local and global 
transactions within the same transactional connection. Local transactions are 
transactions that are started and coordinated by the resource manager internally. The 
XAResource interface is not used for local transactions. 

When using the same connection to perform both local and global transactions, the 
following rules apply: 

. The local transaction must be committed (or rolled back) before starting a 
global transaction in the connection. 
. The global transaction must be disassociated from the connection before any 
local transaction is started. 
--------------------------------------------------


(FWIW, jdbc_fdw would expect to use JTA for this FDW 2PC?)



> I haven’t tested the above yet and it’s just a desk plan. it's
> definitely a good idea to try integrating this 2PC feature to FDWs
> other than postgres_fdw to see if design and interfaces are
> implemented sophisticatedly.

Yes, if we address this 2PC feature as an FDW enhancement, we need to make sure that at least some well-known DBMSs
shouldbe able to implement the new interface.  The following part may help devise the interface:
 


[References from XA specification]
--------------------------------------------------
The primary use of xa_start() is to register a new transaction branch with the RM. 
This marks the start of the branch. Subsequently, the AP, using the same thread of 
control, uses the RM’s native interface to do useful work. All requests for service 
made by the same thread are part of the same branch until the thread dissociates 
from the branch (see below). 

3.3.1 Registration of Resource Managers 
Normally, a TM involves all associated RMs in a transaction branch. (The TM’s set of 
RM switches, described in Section 4.3 on page 21 tells the TM which RMs are 
associated with it.) The TM calls all these RMs with xa_start(), xa_end(), and 
xa_prepare (), although an RM that is not active in a branch need not participate further 
(see Section 2.3.2 on page 8). A technique to reduce overhead for infrequently-used 
RMs is discussed below. 

Dynamic Registration 

Certain RMs, especially those involved in relatively few global transactions, may ask 
the TM to assume they are not involved in a transaction. These RMs must register with 
the TM before they do application work, to see whether the work is part of a global 
transaction. The TM never calls these RMs with any form of xa_start(). An RM 
declares dynamic registration in its switch (see Section 4.3 on page 21). An RM can 
make this declaration only on its own behalf, and doing so does not change the TM’s 
behaviour with respect to other RMs. 

When an AP requests work from such an RM, before doing any work, the RM contacts 
the TM by calling ax_reg(). The RM must call ax_reg() from the same thread of control 
that the AP would use if it called ax_reg() directly. The TM returns to the RM the 
appropriate XID if the AP is in a global transaction. 

The implications of dynamically registering are as follows: when a thread of control 
begins working on behalf of a transaction branch, the transaction manager calls 
xa_start() for all resource managers known to the thread except those having 
TMREGISTER set in their xa_switch_t structure. Thus, those resource managers with 
this flag set must explicitly join a branch with ax_reg(). Secondly, when a thread of 
control is working on behalf of a branch, a transaction manager calls xa_end() for all 
resource managers known to the thread that either do not have TMREGISTER set in 
their xa_switch_t structure or have dynamically registered with ax_reg(). 


int 
xa_start(XID *xid, int rmid, long flags) 

DESCRIPTION 
A transaction manager calls xa_start() to inform a resource manager that an application 
may do work on behalf of a transaction branch.
...
A transaction manager calls xa_start() only for those resource managers that do not 
have TMREGISTER set in the flags element of their xa_switch_t structure. Resource 
managers with TMREGISTER set must use ax_reg() to join a transaction branch (see 
ax_reg() for details). 
--------------------------------------------------


> > To track the foreign transaction status, we can add GetTransactionStatus() to
> the FDW interface as an alternative, can't we?
> 
> I haven't thought such an interface but it sounds like the transaction
> status is managed on both the core and FDWs. Could you elaborate on
> that?

I don't have such deep analysis.  I just thought that the core could keep track of the local transaction status, and
askeach participant FDW about its transaction status to determine an action.
 


> > If the WAL records of multiple concurrent transactions are written and
> synced separately, i.e. group commit doesn't take effect, then the OLTP
> transaction performance will be unacceptable.
> 
> I agree that it'll be a large performance penalty. I'd like to have it
> but I’m not sure we should have it in the first version from the
> perspective of complexity.

I think at least we should have a rough image of how we can reach the goal.  Otherwise, the current
design/implementationmay have to be overhauled with great efforts in the near future.  Apart from that, I feel it's
unnaturalthat the commit processing is serialized at the transaction resolver while the DML processing of multiple
foreigntransactions can be performed in parallel.
 


> Since the procedure of 2PC is originally
> high cost, in my opinion, the user should not use as much as possible
> in terms of performance. Especially in OLTP, its cost will directly
> affect the latency. I’d suggest designing database schema so
> transaction touches only one foreign server but do you have concrete
> OLTP usecase where normally requires 2PC, and how many servers
> involved within a distributed transaction?

I can't share the details, but some of our customers show interest in Postgres scale-out or FDW 2PC for the following
usecases:
 

* Multitenant OLTP where the data specific to one tenant is stored on one database server.  On the other hand, some
dataare shared among all tenants, and they are stored on a separate server.  The shared data and the tenant-specific
datais updated in the same transaction (I don't know the frequency of such transactions.)
 

* An IoT use case where each edge database server monitors and tracks the movement of objects in one area.  Those edge
databaseservers store the records of objects they manage.  When an object gets out of one area and moves to another,
therecord for the object is moved between the two edge database servers using an atomic distributed transaction.
 

(I wonder if TPC-C or TPC-E needs distributed transaction...)


Regards
Takayuki Tsunakawa





Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, 18 Jul 2020 at 01:45, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/17 20:04, Masahiko Sawada wrote:
> > On Fri, 17 Jul 2020 at 14:22, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> >>
> >> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> >> I have briefly checked the only oracle_fdw but in general I think that
> >>> if an existing FDW supports transaction begin, commit, and rollback,
> >>> these can be ported to new FDW transaction APIs easily.
> >>
> >> Does oracle_fdw support begin, commit and rollback?
> >>
> >> And most importantly, do other major DBMSs, including Oracle, provide the API for preparing a transaction?  In
otherwords, will the FDWs other than postgres_fdw really be able to take advantage of the new FDW functions to join the
2PCprocessing?  I think we need to confirm that there are concrete examples. 
> >
> > I also believe they do. But I'm concerned that some FDW needs to start
> > a transaction differently when using 2PC. For instance, IIUC MySQL
> > also supports 2PC but the transaction needs to be started with "XA
> > START id” when the transaction needs to be prepared. The transaction
> > started with XA START can be closed by XA END followed by XA PREPARE
> > or XA COMMIT ONE PHASE.
>
> This means that FDW should provide also the API for xa_end()?
> Maybe we need to consider again which API we should provide in FDW,
> based on XA specification?

Not sure that we really need the API for xa_end(). It's not necessary
at least in MySQL case. mysql_fdw can execute either XA END and XA
PREPARE when FDW prepare API is called or XA END and XA COMMIT ONE
PHASE when FDW commit API is called with FDWXACT_FLAG_ONEPHASE.

>
>
> > It means that when starts a new transaction
> > the transaction needs to prepare the transaction identifier and to
> > know that 2PC might be used. It’s quite different from PostgreSQL. In
> > PostgreSQL, we can start a transaction by BEGIN and end it by PREPARE
> > TRANSACTION, COMMIT, or ROLLBACK. The transaction identifier is
> > required when PREPARE TRANSACTION.
> >
> > With MySQL, I guess FDW needs a way to tell the (next) transaction
> > needs to be started with XA START so it can be prepared. It could be a
> > custom GUC or an SQL function. Then when starts a new transaction on
> > MySQL server, FDW can generate and store a transaction identifier into
> > somewhere alongside the connection. At the prepare phase, it passes
> > the transaction identifier via GetPrepareId() API to the core.
> >
> > I haven’t tested the above yet and it’s just a desk plan. it's
> > definitely a good idea to try integrating this 2PC feature to FDWs
> > other than postgres_fdw to see if design and interfaces are
> > implemented sophisticatedly.
>
> With the current patch, we track whether write queries are executed
> in each server. Then, if the number of servers that execute write queries
> is less than two, 2PC is skipped. This "optimization" is not necessary
> (cannot be applied) when using mysql_fdw because the transaction starts
> with XA START. Right?

I think we can use XA COMMIT ONE PHASE in MySQL, which both prepares
and commits the transaction. If the number of servers that executed
write queries is less than two, the core transaction manager calls
CommitForeignTransaction API with the flag FDWXACT_FLAG_ONEPHASE. That
way, mysql_fdw can execute XA COMMIT ONE PHASE instead of XA PREPARE,
following XA END. On the other hand, when the number of such servers
is greater than or equals to two, the core transaction manager calls
PrepareForeignTransaction API and then CommitForeignTransactionAPI
without that flag. In this case, mysql_fdw can execute XA END and XA
PREPARE in PrepareForeignTransaction API call, and then XA COMMIT in
CommitForeignTransaction API call.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/16 14:47, Masahiko Sawada wrote:
> > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>> I've attached the latest version patches. I've incorporated the review
> >>>> comments I got so far and improved locking strategy.
> >>>
> >>> Thanks for updating the patch!
> >>
> >> +1
> >> I'm interested in these patches and now studying them. While checking
> >> the behaviors of the patched PostgreSQL, I got three comments.
> >
> > Thank you for testing this patch!
> >
> >>
> >> 1. We can access to the foreign table even during recovery in the HEAD.
> >> But in the patched version, when I did that, I got the following error.
> >> Is this intentional?
> >>
> >> ERROR:  cannot assign TransactionIds during recovery
> >
> > No, it should be fixed. I'm going to fix this by not collecting
> > participants for atomic commit during recovery.
>
> Thanks for trying to fix the issues!
>
> I'd like to report one more issue. When I started new transaction
> in the local server, executed INSERT in the remote server via
> postgres_fdw and then quit psql, I got the following assertion failure.
>
> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> 5   postgres                            0x000000010d313da3 proc_exit + 19
> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> 11  postgres                            0x000000010d1597b9 main + 761
> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> 13  ???                                 0x0000000000000003 0x0 + 3
>

Thank you for reporting the issue!

I've attached the latest version patch that incorporated all comments
I got so far. I've removed the patch adding the 'prefer' mode of
foreign_twophase_commit to keep the patch set simple.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Ahsan Hadi
Date:


On Fri, Jul 17, 2020 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:


On 2020/07/16 14:47, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>> I've attached the latest version patches. I've incorporated the review
>>>> comments I got so far and improved locking strategy.
>>>
>>> Thanks for updating the patch!
>>
>> +1
>> I'm interested in these patches and now studying them. While checking
>> the behaviors of the patched PostgreSQL, I got three comments.
>
> Thank you for testing this patch!
>
>>
>> 1. We can access to the foreign table even during recovery in the HEAD.
>> But in the patched version, when I did that, I got the following error.
>> Is this intentional?
>>
>> ERROR:  cannot assign TransactionIds during recovery
>
> No, it should be fixed. I'm going to fix this by not collecting
> participants for atomic commit during recovery.

Thanks for trying to fix the issues!

I'd like to report one more issue. When I started new transaction
in the local server, executed INSERT in the remote server via
postgres_fdw and then quit psql, I got the following assertion failure.

TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
3   postgres                            0x000000010d313fe3 shmem_exit + 179
4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
5   postgres                            0x000000010d313da3 proc_exit + 19
6   postgres                            0x000000010d35112f PostgresMain + 3711
7   postgres                            0x000000010d27bb3a BackendRun + 570
8   postgres                            0x000000010d27af6b BackendStartup + 475
9   postgres                            0x000000010d279ed1 ServerLoop + 593
10  postgres                            0x000000010d277940 PostmasterMain + 6016
11  postgres                            0x000000010d1597b9 main + 761
12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
13  ???                                 0x0000000000000003 0x0 + 3

I have done a test with the latest set of patches shared by Swada and I am not able to reproduce this issue. Started a prepared transaction on the local server and then did a couple of inserts in a remote table using postgres_fdw and the quit psql. I am not able to reproduce the assertion failure.

 

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




--
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.hadi@highgo.ca

Re: Transactions involving multiple postgres foreign servers, take 2

From
Muhammad Usama
Date:


On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/16 14:47, Masahiko Sawada wrote:
> > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>> I've attached the latest version patches. I've incorporated the review
> >>>> comments I got so far and improved locking strategy.
> >>>
> >>> Thanks for updating the patch!
> >>
> >> +1
> >> I'm interested in these patches and now studying them. While checking
> >> the behaviors of the patched PostgreSQL, I got three comments.
> >
> > Thank you for testing this patch!
> >
> >>
> >> 1. We can access to the foreign table even during recovery in the HEAD.
> >> But in the patched version, when I did that, I got the following error.
> >> Is this intentional?
> >>
> >> ERROR:  cannot assign TransactionIds during recovery
> >
> > No, it should be fixed. I'm going to fix this by not collecting
> > participants for atomic commit during recovery.
>
> Thanks for trying to fix the issues!
>
> I'd like to report one more issue. When I started new transaction
> in the local server, executed INSERT in the remote server via
> postgres_fdw and then quit psql, I got the following assertion failure.
>
> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> 5   postgres                            0x000000010d313da3 proc_exit + 19
> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> 11  postgres                            0x000000010d1597b9 main + 761
> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> 13  ???                                 0x0000000000000003 0x0 + 3
>

Thank you for reporting the issue!

I've attached the latest version patch that incorporated all comments
I got so far. I've removed the patch adding the 'prefer' mode of
foreign_twophase_commit to keep the patch set simple.

I have started to review the patchset. Just a quick comment.

Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch 
contains changes (adding fdwxact includes) for
src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
and  src/backend/executor/execPartition.c files that doesn't seem to be
required with the latest version.


Thanks
Best regards
Muhammad Usama

 

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> >
>> >
>> >
>> > On 2020/07/16 14:47, Masahiko Sawada wrote:
>> > > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> > >>
>> > >>
>> > >>
>> > >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>> > >>>> I've attached the latest version patches. I've incorporated the review
>> > >>>> comments I got so far and improved locking strategy.
>> > >>>
>> > >>> Thanks for updating the patch!
>> > >>
>> > >> +1
>> > >> I'm interested in these patches and now studying them. While checking
>> > >> the behaviors of the patched PostgreSQL, I got three comments.
>> > >
>> > > Thank you for testing this patch!
>> > >
>> > >>
>> > >> 1. We can access to the foreign table even during recovery in the HEAD.
>> > >> But in the patched version, when I did that, I got the following error.
>> > >> Is this intentional?
>> > >>
>> > >> ERROR:  cannot assign TransactionIds during recovery
>> > >
>> > > No, it should be fixed. I'm going to fix this by not collecting
>> > > participants for atomic commit during recovery.
>> >
>> > Thanks for trying to fix the issues!
>> >
>> > I'd like to report one more issue. When I started new transaction
>> > in the local server, executed INSERT in the remote server via
>> > postgres_fdw and then quit psql, I got the following assertion failure.
>> >
>> > TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>> > 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>> > 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>> > 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>> > 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>> > 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>> > 5   postgres                            0x000000010d313da3 proc_exit + 19
>> > 6   postgres                            0x000000010d35112f PostgresMain + 3711
>> > 7   postgres                            0x000000010d27bb3a BackendRun + 570
>> > 8   postgres                            0x000000010d27af6b BackendStartup + 475
>> > 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>> > 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>> > 11  postgres                            0x000000010d1597b9 main + 761
>> > 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>> > 13  ???                                 0x0000000000000003 0x0 + 3
>> >
>>
>> Thank you for reporting the issue!
>>
>> I've attached the latest version patch that incorporated all comments
>> I got so far. I've removed the patch adding the 'prefer' mode of
>> foreign_twophase_commit to keep the patch set simple.
>
>
> I have started to review the patchset. Just a quick comment.
>
> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> contains changes (adding fdwxact includes) for
> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> and  src/backend/executor/execPartition.c files that doesn't seem to be
> required with the latest version.

Thanks for your comment.

Right. I've removed these changes on the local branch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/07/27 15:59, Masahiko Sawada wrote:
> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>
>>
>>
>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>
>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>
>>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> +1
>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>
>>>>> Thank you for testing this patch!
>>>>>
>>>>>>
>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>> Is this intentional?
>>>>>>
>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>
>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>> participants for atomic commit during recovery.
>>>>
>>>> Thanks for trying to fix the issues!
>>>>
>>>> I'd like to report one more issue. When I started new transaction
>>>> in the local server, executed INSERT in the remote server via
>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>
>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>
>>>
>>> Thank you for reporting the issue!
>>>
>>> I've attached the latest version patch that incorporated all comments
>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>> foreign_twophase_commit to keep the patch set simple.
>>
>>
>> I have started to review the patchset. Just a quick comment.
>>
>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>> contains changes (adding fdwxact includes) for
>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>> required with the latest version.
> 
> Thanks for your comment.
> 
> Right. I've removed these changes on the local branch.

The latest patches failed to be applied to the master branch. Could you rebase the patches?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/27 15:59, Masahiko Sawada wrote:
> > On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>>
> >>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
> >>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>>>>>> I've attached the latest version patches. I've incorporated the review
> >>>>>>>> comments I got so far and improved locking strategy.
> >>>>>>>
> >>>>>>> Thanks for updating the patch!
> >>>>>>
> >>>>>> +1
> >>>>>> I'm interested in these patches and now studying them. While checking
> >>>>>> the behaviors of the patched PostgreSQL, I got three comments.
> >>>>>
> >>>>> Thank you for testing this patch!
> >>>>>
> >>>>>>
> >>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
> >>>>>> But in the patched version, when I did that, I got the following error.
> >>>>>> Is this intentional?
> >>>>>>
> >>>>>> ERROR:  cannot assign TransactionIds during recovery
> >>>>>
> >>>>> No, it should be fixed. I'm going to fix this by not collecting
> >>>>> participants for atomic commit during recovery.
> >>>>
> >>>> Thanks for trying to fix the issues!
> >>>>
> >>>> I'd like to report one more issue. When I started new transaction
> >>>> in the local server, executed INSERT in the remote server via
> >>>> postgres_fdw and then quit psql, I got the following assertion failure.
> >>>>
> >>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> >>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> >>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> >>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> >>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> >>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> >>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
> >>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> >>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> >>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> >>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> >>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> >>>> 11  postgres                            0x000000010d1597b9 main + 761
> >>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> >>>> 13  ???                                 0x0000000000000003 0x0 + 3
> >>>>
> >>>
> >>> Thank you for reporting the issue!
> >>>
> >>> I've attached the latest version patch that incorporated all comments
> >>> I got so far. I've removed the patch adding the 'prefer' mode of
> >>> foreign_twophase_commit to keep the patch set simple.
> >>
> >>
> >> I have started to review the patchset. Just a quick comment.
> >>
> >> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> >> contains changes (adding fdwxact includes) for
> >> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> >> and  src/backend/executor/execPartition.c files that doesn't seem to be
> >> required with the latest version.
> >
> > Thanks for your comment.
> >
> > Right. I've removed these changes on the local branch.
>
> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>

Thank you for letting me know. I've attached the latest version patch set.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
> On 2020-07-17 15:55, Masahiko Sawada wrote:
>> On Fri, 17 Jul 2020 at 11:06, Masahiro Ikeda 
>> <ikedamsh(at)oss(dot)nttdata(dot)com>
>> wrote:
>>> 
>>> On 2020-07-16 13:16, Masahiko Sawada wrote:
>>>> On Tue, 14 Jul 2020 at 17:24, Masahiro Ikeda 
>>>> <ikedamsh(at)oss(dot)nttdata(dot)com>
>>>> wrote:
>>>>> 
>>>>>> I've attached the latest version patches. I've incorporated the 
>>>>>> review
>>>>>> comments I got so far and improved locking strategy.
>>>>> 
>>>>> I want to ask a question about streaming replication with 2PC.
>>>>> Are you going to support 2PC with streaming replication?
>>>>> 
>>>>> I tried streaming replication using v23 patches.
>>>>> I confirm that 2PC works with streaming replication,
>>>>> which there are primary/standby coordinator.
>>>>> 
>>>>> But, in my understanding, the WAL of "PREPARE" and
>>>>> "COMMIT/ABORT PREPARED" can't be replicated to the standby server 
>>>>> in
>>>>> sync.
>>>>> 
>>>>> If this is right, the unresolved transaction can be occurred.
>>>>> 
>>>>> For example,
>>>>> 
>>>>> 1. PREPARE is done
>>>>> 2. crash primary before the WAL related to PREPARE is
>>>>>     replicated to the standby server
>>>>> 3. promote standby server // but can't execute "ABORT PREPARED"
>>>>> 
>>>>> In above case, the remote server has the unresolved transaction.
>>>>> Can we solve this problem to support in-sync replication?
>>>>> 
>>>>> But, I think some users use async replication for performance.
>>>>> Do we need to document the limitation or make another solution?
>>>>> 
>>>> 
>>>> IIUC with synchronous replication, we can guarantee that WAL records
>>>> are written on both primary and replicas when the client got an
>>>> acknowledgment of commit. We don't replicate each WAL records
>>>> generated during transaction one by one in sync. In the case you
>>>> described, the client will get an error due to the server crash.
>>>> Therefore I think the user cannot expect WAL records generated so 
>>>> far
>>>> has been replicated. The same issue could happen also when the user
>>>> executes PREPARE TRANSACTION and the server crashes.
>>> 
>>> Thanks! I didn't noticed the behavior when a user executes PREPARE
>>> TRANSACTION is same.
>>> 
>>> IIUC with 2PC, there is a different point between (1)PREPARE
>>> TRANSACTION
>>> and (2)2PC.
>>> The point is that whether the client can know when the server crashed
>>> and it's global tx id.
>>> 
>>> If (1)PREPARE TRANSACTION is failed, it's ok the client execute same
>>> command
>>> because if the remote server is already prepared the command will be
>>> ignored.
>>> 
>>> But, if (2)2PC is failed with coordinator crash, the client can't 
>>> know
>>> what operations should be done.
>>> 
>>> If the old coordinator already executed PREPARED, there are some
>>> transaction which should be ABORT PREPARED.
>>> But if the PREPARED WAL is not sent to the standby, the new
>>> coordinator
>>> can't execute ABORT PREPARED.
>>> And the client can't know which remote servers have PREPARED
>>> transactions which should be ABORTED either.
>>> 
>>> Even if the client can know that, only the old coordinator knows its
>>> global transaction id.
>>> Only the database administrator can analyze the old coordinator's log
>>> and then execute the appropriate commands manually, right?
>> 
>> I think that's right. In the case of the coordinator crash, the user
>> can look orphaned foreign prepared transactions by checking the
>> 'identifier' column of pg_foreign_xacts on the new standby server and
>> the prepared transactions on the remote servers.
>> 
> I think there is a case we can't check orphaned foreign
> prepared transaction in pg_foreign_xacts view on the new standby 
> server.
> It confuses users and database administrators.
> 
> If the primary coordinator crashes after preparing foreign transaction,
> but before sending XLOG_FDWXACT_INSERT records to the standby server,
> the standby server can't restore their transaction status and
> pg_foreign_xacts view doesn't show the prepared foreign transactions.
> 
> To send XLOG_FDWXACT_INSERT records asynchronously leads this problem.

If the primary replicates XLOG_FDWXACT_INSERT to the standby 
asynchronously,
some prepared transaction may be unsolved forever.

Since I think to solve this inconsistency manually is hard operation,
we need to support synchronous XLOG_FDWXACT_INSERT replication.

I understood that there are a lot of impact to the performance,
but users can control the consistency/durability vs performance
with synchronous_commit parameter.

What do you think?


> Thank you for letting me know. I've attached the latest version patch 
> set.

Thanks for updating.
But, the latest patches failed to be applied to the master branch.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 28 Aug 2020 at 17:50, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> > I think there is a case we can't check orphaned foreign
> > prepared transaction in pg_foreign_xacts view on the new standby
> > server.
> > It confuses users and database administrators.
> >
> > If the primary coordinator crashes after preparing foreign transaction,
> > but before sending XLOG_FDWXACT_INSERT records to the standby server,
> > the standby server can't restore their transaction status and
> > pg_foreign_xacts view doesn't show the prepared foreign transactions.
> >
> > To send XLOG_FDWXACT_INSERT records asynchronously leads this problem.
>
> If the primary replicates XLOG_FDWXACT_INSERT to the standby
> asynchronously,
> some prepared transaction may be unsolved forever.
>
> Since I think to solve this inconsistency manually is hard operation,
> we need to support synchronous XLOG_FDWXACT_INSERT replication.
>
> I understood that there are a lot of impact to the performance,
> but users can control the consistency/durability vs performance
> with synchronous_commit parameter.
>
> What do you think?

I think the user can check such prepared transactions by seeing
transactions that exist on the foreign server's pg_prepared_xact but
not on the coordinator server's pg_foreign_xacts, no? To make checking
such prepared transactions easy, perhaps we could contain the
timestamp to prepared transaction id. But I’m concerned the
duplication of transaction id due to clock skew.

If there is a way to identify such unresolved foreign transactions and
it's not cumbersome, given that the likelihood of problem you're
concerned is unlikely high I guess a certain number of would be able
to accept it as a restriction. So I’d recommend not dealing with this
problem in the first version patch and we will be able to improve this
feature to deal with this problem as an additional feature. Thoughts?

> > Thank you for letting me know. I've attached the latest version patch
> > set.
>
> Thanks for updating.
> But, the latest patches failed to be applied to the master branch.

I'll submit the updated version patch.

Regards,
--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
On 2020-09-03 23:08, Masahiko Sawada wrote:
> On Fri, 28 Aug 2020 at 17:50, Masahiro Ikeda <ikedamsh@oss.nttdata.com> 
> wrote:
>> 
>> > I think there is a case we can't check orphaned foreign
>> > prepared transaction in pg_foreign_xacts view on the new standby
>> > server.
>> > It confuses users and database administrators.
>> >
>> > If the primary coordinator crashes after preparing foreign transaction,
>> > but before sending XLOG_FDWXACT_INSERT records to the standby server,
>> > the standby server can't restore their transaction status and
>> > pg_foreign_xacts view doesn't show the prepared foreign transactions.
>> >
>> > To send XLOG_FDWXACT_INSERT records asynchronously leads this problem.
>> 
>> If the primary replicates XLOG_FDWXACT_INSERT to the standby
>> asynchronously,
>> some prepared transaction may be unsolved forever.
>> 
>> Since I think to solve this inconsistency manually is hard operation,
>> we need to support synchronous XLOG_FDWXACT_INSERT replication.
>> 
>> I understood that there are a lot of impact to the performance,
>> but users can control the consistency/durability vs performance
>> with synchronous_commit parameter.
>> 
>> What do you think?
> 
> I think the user can check such prepared transactions by seeing
> transactions that exist on the foreign server's pg_prepared_xact but
> not on the coordinator server's pg_foreign_xacts, no? To make checking
> such prepared transactions easy, perhaps we could contain the
> timestamp to prepared transaction id. But I’m concerned the
> duplication of transaction id due to clock skew.

Thanks for letting me know.
I agreed that we can check pg_prepared_xact and pg_foreign_xacts.

We have to abort the transaction which exists in pg_prepared_xact and
doesn't exist in pg_foreign_xacts manually, don't we?
So users have to use the foreign database which supports to show
prepared transaction status like pg_foreign_xacts.

When duplication of transaction id is made?
I'm sorry that I couldn't understand about clock skew.

IICU, since prepared id may have coordinator's xid, there is no clock 
skew
and we can determine transaction_id uniquely.
If the fdw implements GetPrepareId_function API and it generates
transaction_id without coordinator's xid, your concern will emerge.
But, I can't understand the case to generate transaction_id without 
coordinator's xid.

> If there is a way to identify such unresolved foreign transactions and
> it's not cumbersome, given that the likelihood of problem you're
> concerned is unlikely high I guess a certain number of would be able
> to accept it as a restriction. So I’d recommend not dealing with this
> problem in the first version patch and we will be able to improve this
> feature to deal with this problem as an additional feature. Thoughts?

I agree. Thanks for your comments.

>> > Thank you for letting me know. I've attached the latest version patch
>> > set.
>> 
>> Thanks for updating.
>> But, the latest patches failed to be applied to the master branch.
> 
> I'll submit the updated version patch.

Thanks.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Michael Paquier
Date:
On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> Thank you for letting me know. I've attached the latest version patch set.

This needs a rebase.  Patch 0002 is conflicting with some of the
recent changes done in syncrep.c and procarray.c, at least.
--
Michael

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/08/21 15:25, Masahiko Sawada wrote:
> On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/07/27 15:59, Masahiko Sawada wrote:
>>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>>>
>>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patch!
>>>>>>>>
>>>>>>>> +1
>>>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>>>
>>>>>>> Thank you for testing this patch!
>>>>>>>
>>>>>>>>
>>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>>>> Is this intentional?
>>>>>>>>
>>>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>>>
>>>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>>>> participants for atomic commit during recovery.
>>>>>>
>>>>>> Thanks for trying to fix the issues!
>>>>>>
>>>>>> I'd like to report one more issue. When I started new transaction
>>>>>> in the local server, executed INSERT in the remote server via
>>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>>>
>>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>>>
>>>>>
>>>>> Thank you for reporting the issue!
>>>>>
>>>>> I've attached the latest version patch that incorporated all comments
>>>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>>>> foreign_twophase_commit to keep the patch set simple.
>>>>
>>>>
>>>> I have started to review the patchset. Just a quick comment.
>>>>
>>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>>>> contains changes (adding fdwxact includes) for
>>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>>>> required with the latest version.
>>>
>>> Thanks for your comment.
>>>
>>> Right. I've removed these changes on the local branch.
>>
>> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>>
> 
> Thank you for letting me know. I've attached the latest version patch set.

Thanks for updating the patch!

IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually. What about separating
the feature into the following parts?

#1
Originally the server just executed xact callback that each FDW registered
when the transaction was committed. The patch changes this so that
the server manages the participants of FDW in the transaction and triggers
them to execute COMMIT or ROLLBACK. IMO this change can be applied
without 2PC feature. Thought?

Even if we commit this patch and add new interface for FDW, we would
need to keep the old interface, for the FDW providing only old interface.


#2
Originally when there was the FDW access in the transaction,
PREPARE TRANSACTION on that transaction failed with an error. The patch
allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
even when FDW access occurs in the transaction. IMO this change can be
applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
inside "top" COMMIT command). Thought?

I'm not sure yet whether automatic resolution of "unresolved" prepared
transactions by the resolver process is necessary for this change or not.
If it's not necessary, it's better to exclude the resolver process from this
change, at this stage, to make the patch simpler.


#3
Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
based on the #1 and #2 patches.


What's your opinion about this?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/09/07 17:59, Fujii Masao wrote:
> 
> 
> On 2020/08/21 15:25, Masahiko Sawada wrote:
>> On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>
>>>
>>>
>>> On 2020/07/27 15:59, Masahiko Sawada wrote:
>>>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>>>>
>>>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patch!
>>>>>>>>>
>>>>>>>>> +1
>>>>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>>>>
>>>>>>>> Thank you for testing this patch!
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>>>>> Is this intentional?
>>>>>>>>>
>>>>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>>>>
>>>>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>>>>> participants for atomic commit during recovery.
>>>>>>>
>>>>>>> Thanks for trying to fix the issues!
>>>>>>>
>>>>>>> I'd like to report one more issue. When I started new transaction
>>>>>>> in the local server, executed INSERT in the remote server via
>>>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>>>>
>>>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>>>>
>>>>>>
>>>>>> Thank you for reporting the issue!
>>>>>>
>>>>>> I've attached the latest version patch that incorporated all comments
>>>>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>>>>> foreign_twophase_commit to keep the patch set simple.
>>>>>
>>>>>
>>>>> I have started to review the patchset. Just a quick comment.
>>>>>
>>>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>>>>> contains changes (adding fdwxact includes) for
>>>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>>>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>>>>> required with the latest version.
>>>>
>>>> Thanks for your comment.
>>>>
>>>> Right. I've removed these changes on the local branch.
>>>
>>> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>>>
>>
>> Thank you for letting me know. I've attached the latest version patch set.
> 
> Thanks for updating the patch!
> 
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually. What about separating
> the feature into the following parts?
> 
> #1
> Originally the server just executed xact callback that each FDW registered
> when the transaction was committed. The patch changes this so that
> the server manages the participants of FDW in the transaction and triggers
> them to execute COMMIT or ROLLBACK. IMO this change can be applied
> without 2PC feature. Thought?
> 
> Even if we commit this patch and add new interface for FDW, we would
> need to keep the old interface, for the FDW providing only old interface.
> 
> 
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
> 
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.
> 
> 
> #3
> Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
> based on the #1 and #2 patches.
> 
> 
> What's your opinion about this?

Also I'd like to report some typos in the patch.

+#define ServerSupportTransactionCallack(fdw_part) \

"Callack" in this macro name should be "Callback"?

+#define SeverSupportTwophaseCommit(fdw_part) \

"Sever" in this macro name should be "Server"?

+  proname => 'pg_stop_foreing_xact_resolver', provolatile => 'v', prorettype => 'bool',

"foreing" should be "foreign"?

+ * FdwXact entry we call get_preparedid callback to get a transaction

"get_preparedid" should be "get_prepareid"?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually.
>

Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.
1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.
2. In this thread, we have discussed to try integrating this patch
with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
APIs we are exposing are general enough that other FDWs can use them
to implement 2PC. I could see some speculations about the same but no
concrete work on the same has been done.
3. In another thread [1], we have seen that the patch being discussed
in this thread might need to re-designed if we have to use some other
design for global-visibility than what is proposed in that thread. I
think it is quite likely that can happen considering no one is able to
come up with the solution to major design problems spotted in that
patch yet.

It appears to me that even though these points were raised before in
some form we are just trying to bypass them to commit whatever we have
in the current patch which I find quite surprising.

[1] - https://www.postgresql.org/message-id/07b2c899-4ed0-4c87-1327-23c750311248%40postgrespro.ru

-- 
With Regards,
Amit Kapila.



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/09/08 10:34, Amit Kapila wrote:
> On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> IMO it's not easy to commit this 2PC patch at once because it's still large
>> and complicated. So I'm thinking it's better to separate the feature into
>> several parts and commit them gradually.
>>
> 
> Hmm, I don't see that we have a consensus on the design and or
> interfaces of this patch and without that proceeding for commit
> doesn't seem advisable. Here are a few points which I remember offhand
> that require more work.

Thanks!

> 1. There is a competing design proposed and being discussed in another
> thread [1] for this purpose. I think both the approaches have pros and
> cons but there doesn't seem to be any conclusion yet on which one is
> better.

I was thinking that [1] was discussing global snapshot feature for
"atomic visibility" rather than the solution like 2PC for "atomic commit".
But if another approach for "atomic commit" was also proposed at [1],
that's good. I will check that.

> 2. In this thread, we have discussed to try integrating this patch
> with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
> APIs we are exposing are general enough that other FDWs can use them
> to implement 2PC. I could see some speculations about the same but no
> concrete work on the same has been done.

Yes, you're right.

> 3. In another thread [1], we have seen that the patch being discussed
> in this thread might need to re-designed if we have to use some other
> design for global-visibility than what is proposed in that thread. I
> think it is quite likely that can happen considering no one is able to
> come up with the solution to major design problems spotted in that
> patch yet.

You imply that global-visibility patch should be come first before "2PC" patch?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2020/09/08 10:34, Amit Kapila wrote:
> > On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >> IMO it's not easy to commit this 2PC patch at once because it's still large
> >> and complicated. So I'm thinking it's better to separate the feature into
> >> several parts and commit them gradually.
> >>
> >
> > Hmm, I don't see that we have a consensus on the design and or
> > interfaces of this patch and without that proceeding for commit
> > doesn't seem advisable. Here are a few points which I remember offhand
> > that require more work.
>
> Thanks!
>
> > 1. There is a competing design proposed and being discussed in another
> > thread [1] for this purpose. I think both the approaches have pros and
> > cons but there doesn't seem to be any conclusion yet on which one is
> > better.
>
> I was thinking that [1] was discussing global snapshot feature for
> "atomic visibility" rather than the solution like 2PC for "atomic commit".
> But if another approach for "atomic commit" was also proposed at [1],
> that's good. I will check that.
>

Okay, that makes sense.

> > 2. In this thread, we have discussed to try integrating this patch
> > with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
> > APIs we are exposing are general enough that other FDWs can use them
> > to implement 2PC. I could see some speculations about the same but no
> > concrete work on the same has been done.
>
> Yes, you're right.
>
> > 3. In another thread [1], we have seen that the patch being discussed
> > in this thread might need to re-designed if we have to use some other
> > design for global-visibility than what is proposed in that thread. I
> > think it is quite likely that can happen considering no one is able to
> > come up with the solution to major design problems spotted in that
> > patch yet.
>
> You imply that global-visibility patch should be come first before "2PC" patch?
>

I intend to say that the global-visibility work can impact this in a
major way and we have analyzed that to some extent during a discussion
on the other thread. So, I think without having a complete
design/solution that addresses both the 2PC and global-visibility, it
is not apparent what is the right way to proceed. It seems to me that
rather than working on individual (or smaller) parts one needs to come
up with a bigger picture (or overall design) and then once we have
figured that out correctly, it would be easier to decide which parts
can go first.

-- 
With Regards,
Amit Kapila.



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Amit Kapila <amit.kapila16@gmail.com>
> I intend to say that the global-visibility work can impact this in a
> major way and we have analyzed that to some extent during a discussion
> on the other thread. So, I think without having a complete
> design/solution that addresses both the 2PC and global-visibility, it
> is not apparent what is the right way to proceed. It seems to me that
> rather than working on individual (or smaller) parts one needs to come
> up with a bigger picture (or overall design) and then once we have
> figured that out correctly, it would be easier to decide which parts
> can go first.

I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days.
 

But to proceed with the development, let me comment on the atomic commit and global visibility.

* We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if we
canavoid it.
 

* I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes.
 

* Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to
betransferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this?
 

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues to
beaddressed:
 

1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification.
 

2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC.
 

3. postgres_fdw cannot detect remote updates when the UDF executed on a remote node updates data.


Regards
Takayuki Tsunakawa




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 7 Sep 2020 at 17:59, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/08/21 15:25, Masahiko Sawada wrote:
> > On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/27 15:59, Masahiko Sawada wrote:
> >>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>>>>
> >>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
> >>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>>>>>>>> I've attached the latest version patches. I've incorporated the review
> >>>>>>>>>> comments I got so far and improved locking strategy.
> >>>>>>>>>
> >>>>>>>>> Thanks for updating the patch!
> >>>>>>>>
> >>>>>>>> +1
> >>>>>>>> I'm interested in these patches and now studying them. While checking
> >>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
> >>>>>>>
> >>>>>>> Thank you for testing this patch!
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
> >>>>>>>> But in the patched version, when I did that, I got the following error.
> >>>>>>>> Is this intentional?
> >>>>>>>>
> >>>>>>>> ERROR:  cannot assign TransactionIds during recovery
> >>>>>>>
> >>>>>>> No, it should be fixed. I'm going to fix this by not collecting
> >>>>>>> participants for atomic commit during recovery.
> >>>>>>
> >>>>>> Thanks for trying to fix the issues!
> >>>>>>
> >>>>>> I'd like to report one more issue. When I started new transaction
> >>>>>> in the local server, executed INSERT in the remote server via
> >>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
> >>>>>>
> >>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> >>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> >>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> >>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> >>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> >>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> >>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
> >>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> >>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> >>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> >>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> >>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> >>>>>> 11  postgres                            0x000000010d1597b9 main + 761
> >>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> >>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
> >>>>>>
> >>>>>
> >>>>> Thank you for reporting the issue!
> >>>>>
> >>>>> I've attached the latest version patch that incorporated all comments
> >>>>> I got so far. I've removed the patch adding the 'prefer' mode of
> >>>>> foreign_twophase_commit to keep the patch set simple.
> >>>>
> >>>>
> >>>> I have started to review the patchset. Just a quick comment.
> >>>>
> >>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> >>>> contains changes (adding fdwxact includes) for
> >>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> >>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
> >>>> required with the latest version.
> >>>
> >>> Thanks for your comment.
> >>>
> >>> Right. I've removed these changes on the local branch.
> >>
> >> The latest patches failed to be applied to the master branch. Could you rebase the patches?
> >>
> >
> > Thank you for letting me know. I've attached the latest version patch set.
>
> Thanks for updating the patch!
>
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually. What about separating
> the feature into the following parts?
>
> #1
> Originally the server just executed xact callback that each FDW registered
> when the transaction was committed. The patch changes this so that
> the server manages the participants of FDW in the transaction and triggers
> them to execute COMMIT or ROLLBACK. IMO this change can be applied
> without 2PC feature. Thought?
>
> Even if we commit this patch and add new interface for FDW, we would
> need to keep the old interface, for the FDW providing only old interface.
>
>
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
>
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.
>
>
> #3
> Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
> based on the #1 and #2 patches.
>
>
> What's your opinion about this?

Regardless of which approaches of 2PC implementation being selected
splitting the patch into logical small patches is a good idea and the
above suggestion makes sense to me.

Regarding #2, I guess that we would need resolver and launcher
processes even if we would support only manual PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED commands:

On COMMIT PREPARED command, I think we should commit the local
prepared transaction first then commit foreign prepared transactions.
Otherwise, it violates atomic commit principles when the local node
failed to commit a foreign prepared transaction and the user changed
to ROLLBACK PREPARED. OTOH once we committed locally, we cannot change
to rollback. And attempting to commit foreign prepared transactions
could lead an error due to connection error, OOM caused by palloc etc.
Therefore we discussed using background processes, resolver and
launcher, to take in charge of committing foreign prepared
transactions so that the process who executed COMMIT PREPARED will
never  error out after local commit. So I think the patch #2 will have
the patch also adding resolver and launcher processes. And in the
patch #3 we will change the code to support automatic 2PC as you
suggested.

In addition, the part of the automatic resolution of in-doubt
transactions can also be a separate patch, which will be the #4 patch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
>
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.

I agree with this. However, in case of explicit prepare, if we are not
going to try automatic resolution, it might be better to provide a way
to pass the information about transactions prepared on the foreign
servers if they can not be resolved at the time of commit so that the
user can take it up to resolve those him/herself. This was an idea
that Tom had suggested at the very beginning of the first take.

--
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/09/08 12:03, Amit Kapila wrote:
> On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2020/09/08 10:34, Amit Kapila wrote:
>>> On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>
>>>> IMO it's not easy to commit this 2PC patch at once because it's still large
>>>> and complicated. So I'm thinking it's better to separate the feature into
>>>> several parts and commit them gradually.
>>>>
>>>
>>> Hmm, I don't see that we have a consensus on the design and or
>>> interfaces of this patch and without that proceeding for commit
>>> doesn't seem advisable. Here are a few points which I remember offhand
>>> that require more work.
>>
>> Thanks!
>>
>>> 1. There is a competing design proposed and being discussed in another
>>> thread [1] for this purpose. I think both the approaches have pros and
>>> cons but there doesn't seem to be any conclusion yet on which one is
>>> better.
>>
>> I was thinking that [1] was discussing global snapshot feature for
>> "atomic visibility" rather than the solution like 2PC for "atomic commit".
>> But if another approach for "atomic commit" was also proposed at [1],
>> that's good. I will check that.
>>
> 
> Okay, that makes sense.

I read Alexey's 2PC patch (0001-Add-postgres_fdw.use_twophase-GUC-to-use-2PC.patch)
proposed at [1]. As Alexey told at that thread, there are two big differences
between his patch and Sawada-san's; 1) whether there is the resolver process
for foreign transactions, 2) 2PC logic is implemented only inside postgres_fdw
or both FDW and PostgreSQL core.

I think that 2) is the first decision point. Alexey's 2PC patch is very simple
and all the 2PC logic is implemented only inside postgres_fdw. But this
means that 2PC is not usable if multiple types of FDW (e.g., postgres_fdw
and mysql_fdw) participate at the transaction. This may be ok if we implement
2PC feature only for PostgreSQL sharding using postgres_fdw. But if we
implement 2PC as the improvement on FDW independently from PostgreSQL
sharding, I think that it's necessary to support other FDW. And this is our
direction, isn't it?

Sawada-san's patch supports that case by implememnting some conponents
for that also in PostgreSQL core. For example, with the patch, all the remote
transactions that participate at the transaction are managed by PostgreSQL
core instead of postgres_fdw layer.

Therefore, at least regarding the difference 2), I think that Sawada-san's
approach is better. Thought?

[1]
https://postgr.es/m/3ef7877bfed0582019eab3d462a43275@postgrespro.ru

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
Alexey-san, Sawada-san,
cc: Fujii-san,


From: Fujii Masao <masao.fujii@oss.nttdata.com>
> But if we
> implement 2PC as the improvement on FDW independently from PostgreSQL
> sharding, I think that it's necessary to support other FDW. And this is our
> direction, isn't it?

I understand the same way as Fujii san.  2PC FDW is itself useful, so I think we should pursue the tidy FDW interface
andgood performance withinn the FDW framework.  "tidy" means that many other FDWs should be able to implement it.  I
guessXA/JTA is the only material we can use to consider whether the FDW interface is good.
 


> Sawada-san's patch supports that case by implememnting some conponents
> for that also in PostgreSQL core. For example, with the patch, all the remote
> transactions that participate at the transaction are managed by PostgreSQL
> core instead of postgres_fdw layer.
> 
> Therefore, at least regarding the difference 2), I think that Sawada-san's
> approach is better. Thought?

I think so.  Sawada-san's patch needs to address the design issues I posed before digging into the code for thorough
review,though.
 

BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm concerned about the performance for practical
use. Do you two have differences in these points, for instance?  The first two items are often cited to evaluate the
algorithm'sperformance, as you know.
 

* The number of round trips to remote nodes.
* The number of disk I/Os on each node and all nodes in total (WAL, two-phase file, pg_subtrans file, CLOG?).
* Are prepare and commit executed in parallel on remote nodes? (serious DBMSs do so)
* Is there any serialization point in the processing? (Sawada-san's has one)

I'm sorry to repeat myself, but I don't think we can compromise the 2PC performance.  Of course, we recommend users to
designa schema that co-locates data that each transaction accesses to avoid 2PC, but it's not always possible (e.g.,
whensecondary indexes are used.)
 

Plus, as the following quote from TPC-C specification shows, TPC-C requires 15% of (Payment?) transactions to do 2PC.
(Iknew this on Microsoft, CockroachDB, or Citus Data's site.)
 


--------------------------------------------------
Independent of the mode of selection, the customer resident 
warehouse is the home warehouse 85% of the time and is a randomly selected remote warehouse 15% of the time. 
This can be implemented by generating two random numbers x and y within [1 .. 100]; 

. If x <= 85 a customer is selected from the selected district number (C_D_ID = D_ID) and the home warehouse 
number (C_W_ID = W_ID). The customer is paying through his/her own warehouse. 

. If x > 85 a customer is selected from a random district number (C_D_ID is randomly selected within [1 .. 10]), 
and a random remote warehouse number (C_W_ID is randomly selected within the range of active 
warehouses (see Clause 4.2.2), and C_W_ID ≠ W_ID). The customer is paying through a warehouse and a 
district other than his/her own. 
--------------------------------------------------


Regards
Takayuki Tsunakawa



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/09/10 10:13, tsunakawa.takay@fujitsu.com wrote:
> Alexey-san, Sawada-san,
> cc: Fujii-san,
> 
> 
> From: Fujii Masao <masao.fujii@oss.nttdata.com>
>> But if we
>> implement 2PC as the improvement on FDW independently from PostgreSQL
>> sharding, I think that it's necessary to support other FDW. And this is our
>> direction, isn't it?
> 
> I understand the same way as Fujii san.  2PC FDW is itself useful, so I think we should pursue the tidy FDW interface
andgood performance withinn the FDW framework.  "tidy" means that many other FDWs should be able to implement it.  I
guessXA/JTA is the only material we can use to consider whether the FDW interface is good.
 

Originally start(), commit() and rollback() are supported as FDW interfaces. With his patch, prepare() is supported.
Whatother interfaces need to be supported per XA/JTA?
 

As far as I and Sawada-san discussed this upthread, to support MySQL, another type of start() would be necessary to
issue"XA START id" command. end() might be also necessary to issue "XA END id", but that command can be issued via
prepare()together with "XA PREPARE id".
 

I'm not familiar with XA/JTA and XA transaction interfaces on other major DBMS. So I'd like to know what other
interfacesare necessary additionally?
 

> 
> 
>> Sawada-san's patch supports that case by implememnting some conponents
>> for that also in PostgreSQL core. For example, with the patch, all the remote
>> transactions that participate at the transaction are managed by PostgreSQL
>> core instead of postgres_fdw layer.
>>
>> Therefore, at least regarding the difference 2), I think that Sawada-san's
>> approach is better. Thought?
> 
> I think so.  Sawada-san's patch needs to address the design issues I posed before digging into the code for thorough
review,though.
 
> 
> BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm concerned about the performance for
practicaluse.  Do you two have differences in these points, for instance?
 

IMO Sawada-san's version of 2PC is less performant, but it's because
his patch provides more functionality. For example, with his patch,
WAL is written to automatically complete the unresolve foreign transactions
in the case of failure. OTOH, Alexey patch introduces no new WAL for 2PC.
Of course, generating more WAL would cause more overhead.
But if we need automatic resolution feature, it's inevitable to introduce
new WAL whichever the patch we choose.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > I intend to say that the global-visibility work can impact this in a
> > major way and we have analyzed that to some extent during a discussion
> > on the other thread. So, I think without having a complete
> > design/solution that addresses both the 2PC and global-visibility, it
> > is not apparent what is the right way to proceed. It seems to me that
> > rather than working on individual (or smaller) parts one needs to come
> > up with a bigger picture (or overall design) and then once we have
> > figured that out correctly, it would be easier to decide which parts
> > can go first.
>
> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days. 
>
> But to proceed with the development, let me comment on the atomic commit and global visibility.
>
> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if
wecan avoid it. 
>
> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes. 
>
> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information
tobe transferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this? 
>
> Commitment ordering (CO) - yoavraz2
> https://sites.google.com/site/yoavraz2/the_principle_of_co
>
>
> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues
tobe addressed: 
>
> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification. 

I agree that we need to verify new FDW APIs will be suitable for other
FDWs than postgres_fdw as well.

>
> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC. 

Not sure it's safe that each backend perform PREPARE and COMMIT
PREPARED since the current design is for not leading an inconsistency
between the actual transaction result and the result the user sees.
But in the future, I think we can have multiple background workers per
database for better performance.

>
> 3. postgres_fdw cannot detect remote updates when the UDF executed on a remote node updates data.

I assume that you mean the pushing the UDF down to a foreign server.
If so, I think we can do this by improving postgres_fdw. In the
current patch, registering and unregistering a foreign server to a
group of 2PC and marking a foreign server as updated is FDW
responsible. So perhaps if we had a way to tell postgres_fdw that the
UDF might update the data on the foreign server, postgres_fdw could
mark the foreign server as updated if the UDF is shippable.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2020/09/11 0:37, Masahiko Sawada wrote:
> On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
>>
>> From: Amit Kapila <amit.kapila16@gmail.com>
>>> I intend to say that the global-visibility work can impact this in a
>>> major way and we have analyzed that to some extent during a discussion
>>> on the other thread. So, I think without having a complete
>>> design/solution that addresses both the 2PC and global-visibility, it
>>> is not apparent what is the right way to proceed. It seems to me that
>>> rather than working on individual (or smaller) parts one needs to come
>>> up with a bigger picture (or overall design) and then once we have
>>> figured that out correctly, it would be easier to decide which parts
>>> can go first.
>>
>> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days.
 
>>
>> But to proceed with the development, let me comment on the atomic commit and global visibility.
>>
>> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if
wecan avoid it.
 
>>
>> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes.
 
>>
>> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information
tobe transferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this?
 
>>
>> Commitment ordering (CO) - yoavraz2
>> https://sites.google.com/site/yoavraz2/the_principle_of_co
>>
>>
>> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues
tobe addressed:
 
>>
>> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification.
 
> 
> I agree that we need to verify new FDW APIs will be suitable for other
> FDWs than postgres_fdw as well.
> 
>>
>> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC.
 
> 
> Not sure it's safe that each backend perform PREPARE and COMMIT
> PREPARED since the current design is for not leading an inconsistency
> between the actual transaction result and the result the user sees.

Can I check my understanding about why the resolver process is necessary?

Firstly, you think that issuing COMMIT PREPARED command to the foreign server can cause an error, for example, because
ofconnection error, OOM, etc. On the other hand, only waiting for other process to issue the command is less likely to
causean error. Right?
 

If an error occurs in backend process after commit record is WAL-logged, the error would be reported to the client and
itmay misunderstand that the transaction failed even though commit record was already flushed. So you think that each
backendshould not issue COMMIT PREPARED command to avoid that inconsistency. To avoid that, it's better to make other
process,the resolver, issue the command and just make each backend wait for that to completed. Right?
 

Also using the resolver process has another merit; when there are unresolved foreign transactions but the corresponding
backendexits, the resolver can try to resolve them. If something like this automatic resolution is necessary, the
processlike the resolver would be necessary. Right?
 

To the contrary, if we don't need such automatic resolution (i.e., unresolved foreign transactions always need to be
resolvedmanually) and we can prevent the code to issue COMMIT PREPARED command from causing an error (not sure if
that'spossible, though...), probably we don't need the resolver process. Right?
 


> But in the future, I think we can have multiple background workers per
> database for better performance.

Yes, that's an idea.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Fujii Masao <masao.fujii@oss.nttdata.com>
> Originally start(), commit() and rollback() are supported as FDW interfaces.
> As far as I and Sawada-san discussed this upthread, to support MySQL,
> another type of start() would be necessary to issue "XA START id" command.
> end() might be also necessary to issue "XA END id", but that command can be
> issued via prepare() together with "XA PREPARE id".

Yeah, I think we can call xa_end and xa_prepare in the FDW's prepare function.

The issue is when to call xa_start, which requires XID as an argument.  We don't want to call it in transactions that
accessonly one node...?
 


> With his patch, prepare() is supported. What other interfaces need to be
> supported per XA/JTA?
> 
> I'm not familiar with XA/JTA and XA transaction interfaces on other major
> DBMS. So I'd like to know what other interfaces are necessary additionally?

I think xa_start, xa_end, xa_prepare, xa_commit, xa_rollback, and xa_recover are sufficient.  The XA specification is
here:

https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf

You can see the function reference in Chapter 5, and the concept in Chapter 3.  Chapter 6 was probably showing the
statetransition (function call sequence.)
 


> IMO Sawada-san's version of 2PC is less performant, but it's because his
> patch provides more functionality. For example, with his patch, WAL is written
> to automatically complete the unresolve foreign transactions in the case of
> failure. OTOH, Alexey patch introduces no new WAL for 2PC.
> Of course, generating more WAL would cause more overhead.
> But if we need automatic resolution feature, it's inevitable to introduce new
> WAL whichever the patch we choose.

Please do not get me wrong.  I know Sawada-san is trying to ensure durability.  I just wanted to know what each patch
doesin how much cost in terms of disk and network I/Os, and if one patch can take something from another for less cost.
I'm simply guessing (without having read the code yet) that each transaction basically does:
 

- two round trips (prepare, commit) to each remote node
- two WAL writes (prepare, commit) on the local node and each remote node
- one write for two-phase state file on each remote node
- one write to record participants on the local node

It felt hard to think about the algorithm efficiency from the source code.  As you may have seen, the DBMS textbook
and/orpapers describe disk and network I/Os to evaluate algorithms.  I thought such information would be useful before
goingdeeper into the source code.  Maybe such things can be written in the following Sawada-san's wiki or README in the
end.

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions


Regards
Takayuki Tsunakawa





RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > 2. 2PC processing is queued and serialized in one background worker.  That
> severely subdues transaction throughput.  Each backend should perform
> 2PC.
> 
> Not sure it's safe that each backend perform PREPARE and COMMIT
> PREPARED since the current design is for not leading an inconsistency
> between the actual transaction result and the result the user sees.

As Fujii-san is asking, I also would like to know what situation you think is not safe.  Are you worried that the FDW's
commitfunction might call ereport(ERROR | FATAL | PANIC)?  If so, can't we stipulate that the FDW implementor should
ensurethat the commit function always returns control to the caller?
 


> But in the future, I think we can have multiple background workers per
> database for better performance.

Does the database in "per database" mean the local database (that applications connect to), or the remote database
accessedvia FDW?
 

I'm wondering how the FDW and background worker(s) can realize parallel prepare and parallel commit.  That is, the
coordinatortransaction performs:
 

1. Issue prepare to all participant nodes, but doesn't wait for the reply for each issue.
2. Waits for replies from all participants.
3. Issue commit to all participant nodes, but doesn't wait for the reply for each issue.
4. Waits for replies from all participants.

If we just consider PostgreSQL and don't think about FDW, we can use libpq async functions -- PQsendQuery,
PQconsumeInput,and PQgetResult.  pgbench uses them so that one thread can issue SQL statements on multiple connections
inparallel.
 

But when we consider the FDW interface, plus other DBMSs, how can we achieve the parallelism?


> > 3. postgres_fdw cannot detect remote updates when the UDF executed on a
> remote node updates data.
> 
> I assume that you mean the pushing the UDF down to a foreign server.
> If so, I think we can do this by improving postgres_fdw. In the current patch,
> registering and unregistering a foreign server to a group of 2PC and marking a
> foreign server as updated is FDW responsible. So perhaps if we had a way to
> tell postgres_fdw that the UDF might update the data on the foreign server,
> postgres_fdw could mark the foreign server as updated if the UDF is shippable.

Maybe we can consider VOLATILE functions update data.  That may be overreaction, though.

Another idea is to add a new value to the ReadyForQuery message in the FE/BE protocol.  Say, 'U' if in a transaction
blockthat updated data.  Here we consider "updated" as having allocated an XID.
 

52.7. Message Formats
https://www.postgresql.org/docs/devel/protocol-message-formats.html
--------------------------------------------------
ReadyForQuery (B)

Byte1
Current backend transaction status indicator. Possible values are 'I' if idle (not in a transaction block); 'T' if in a
transactionblock; or 'E' if in a failed transaction block (queries will be rejected until block is ended).
 
--------------------------------------------------


Regards
Takayuki Tsunakawa



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 11 Sep 2020 at 11:58, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/09/11 0:37, Masahiko Sawada wrote:
> > On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> >>
> >> From: Amit Kapila <amit.kapila16@gmail.com>
> >>> I intend to say that the global-visibility work can impact this in a
> >>> major way and we have analyzed that to some extent during a discussion
> >>> on the other thread. So, I think without having a complete
> >>> design/solution that addresses both the 2PC and global-visibility, it
> >>> is not apparent what is the right way to proceed. It seems to me that
> >>> rather than working on individual (or smaller) parts one needs to come
> >>> up with a bigger picture (or overall design) and then once we have
> >>> figured that out correctly, it would be easier to decide which parts
> >>> can go first.
> >>
> >> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to
discussthe big picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes,
nobodyis holding me captive; I'm just late.)  Please wait a few days. 
> >>
> >> But to proceed with the development, let me comment on the atomic commit and global visibility.
> >>
> >> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and
ifwe can avoid it. 
> >>
> >> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes. 
> >>
> >> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other
informationto be transferred among the cluster nodes.  However, this seems to have to track the order of read and write
operationsamong concurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The
MVCOpaper seems to present the information we need, but I haven't understood it well yet (it's difficult.)  Could you
anybodykindly interpret this? 
> >>
> >> Commitment ordering (CO) - yoavraz2
> >> https://sites.google.com/site/yoavraz2/the_principle_of_co
> >>
> >>
> >> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following
issuesto be addressed: 
> >>
> >> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification. 
> >
> > I agree that we need to verify new FDW APIs will be suitable for other
> > FDWs than postgres_fdw as well.
> >
> >>
> >> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction
throughput. Each backend should perform 2PC. 
> >
> > Not sure it's safe that each backend perform PREPARE and COMMIT
> > PREPARED since the current design is for not leading an inconsistency
> > between the actual transaction result and the result the user sees.
>
> Can I check my understanding about why the resolver process is necessary?
>
> Firstly, you think that issuing COMMIT PREPARED command to the foreign server can cause an error, for example,
becauseof connection error, OOM, etc. On the other hand, only waiting for other process to issue the command is less
likelyto cause an error. Right? 
>
> If an error occurs in backend process after commit record is WAL-logged, the error would be reported to the client
andit may misunderstand that the transaction failed even though commit record was already flushed. So you think that
eachbackend should not issue COMMIT PREPARED command to avoid that inconsistency. To avoid that, it's better to make
otherprocess, the resolver, issue the command and just make each backend wait for that to completed. Right? 
>
> Also using the resolver process has another merit; when there are unresolved foreign transactions but the
correspondingbackend exits, the resolver can try to resolve them. If something like this automatic resolution is
necessary,the process like the resolver would be necessary. Right? 
>
> To the contrary, if we don't need such automatic resolution (i.e., unresolved foreign transactions always need to be
resolvedmanually) and we can prevent the code to issue COMMIT PREPARED command from causing an error (not sure if
that'spossible, though...), probably we don't need the resolver process. Right? 

Yes, I'm on the same page about all the above explanations.

The resolver process has two functionalities: resolving foreign
transactions automatically when the user issues COMMIT (the case you
described in the second paragraph), and resolving foreign transaction
when the corresponding backend no longer exist or when the server
crashes during in the middle of 2PC (described in the third
paragraph).

Considering the design without the resolver process, I think we can
easily replace the latter with the manual resolution. OTOH, it's not
easy for the former. I have no idea about better design for now,
although, as you described, if we could ensure that the process
doesn't raise an error during resolving foreign transactions after
committing the local transaction we would not need the resolver
process.

Or the second idea would be that the backend commits only the local
transaction then returns the acknowledgment of COMMIT to the user
without resolving foreign transactions. Then the user manually
resolves the foreign transactions by, for example, using the SQL
function pg_resolve_foreign_xact() within a separate transaction. That
way, even if an error occurred during resolving foreign transactions
(i.g., executing COMMIT PREPARED), it’s okay as the user is already
aware of the local transaction having been committed and can retry to
resolve the unresolved foreign transaction. So we won't need the
resolver process while avoiding such inconsistency.

But a drawback would be that the transaction commit doesn't ensure
that all foreign transactions are completed. The subsequent
transactions would need to check if the previous distributed
transaction is completed to see its results. I’m not sure it’s a good
design in terms of usability.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 11 Sep 2020 at 18:24, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > 2. 2PC processing is queued and serialized in one background worker.  That
> > severely subdues transaction throughput.  Each backend should perform
> > 2PC.
> >
> > Not sure it's safe that each backend perform PREPARE and COMMIT
> > PREPARED since the current design is for not leading an inconsistency
> > between the actual transaction result and the result the user sees.
>
> As Fujii-san is asking, I also would like to know what situation you think is not safe.  Are you worried that the
FDW'scommit function might call ereport(ERROR | FATAL | PANIC)?
 

Yes.

> If so, can't we stipulate that the FDW implementor should ensure that the commit function always returns control to
thecaller?
 

How can the FDW implementor ensure that? Since even palloc could call
ereport(ERROR) I guess it's hard to require that to all FDW
implementors.

>
>
> > But in the future, I think we can have multiple background workers per
> > database for better performance.
>
> Does the database in "per database" mean the local database (that applications connect to), or the remote database
accessedvia FDW?
 

I meant the local database. In the current patch, we launch the
resolver process per local database. My idea is to allow launching
multiple resolver processes for one local database as long as the
number of workers doesn't exceed the limit.

>
> I'm wondering how the FDW and background worker(s) can realize parallel prepare and parallel commit.  That is, the
coordinatortransaction performs:
 
>
> 1. Issue prepare to all participant nodes, but doesn't wait for the reply for each issue.
> 2. Waits for replies from all participants.
> 3. Issue commit to all participant nodes, but doesn't wait for the reply for each issue.
> 4. Waits for replies from all participants.
>
> If we just consider PostgreSQL and don't think about FDW, we can use libpq async functions -- PQsendQuery,
PQconsumeInput,and PQgetResult.  pgbench uses them so that one thread can issue SQL statements on multiple connections
inparallel.
 
>
> But when we consider the FDW interface, plus other DBMSs, how can we achieve the parallelism?

It's still a rough idea but I think we can use TMASYNC flag and
xa_complete explained in the XA specification. The core transaction
manager call prepare, commit, rollback APIs with the flag, requiring
to execute the operation asynchronously and to return a handler (e.g.,
a socket taken by PQsocket in postgres_fdw case) to the transaction
manager. Then the transaction manager continues polling the handler
until it becomes readable and testing the completion using by
xa_complete() with no wait, until all foreign servers return OK on
xa_complete check.

>
>
> > > 3. postgres_fdw cannot detect remote updates when the UDF executed on a
> > remote node updates data.
> >
> > I assume that you mean the pushing the UDF down to a foreign server.
> > If so, I think we can do this by improving postgres_fdw. In the current patch,
> > registering and unregistering a foreign server to a group of 2PC and marking a
> > foreign server as updated is FDW responsible. So perhaps if we had a way to
> > tell postgres_fdw that the UDF might update the data on the foreign server,
> > postgres_fdw could mark the foreign server as updated if the UDF is shippable.
>
> Maybe we can consider VOLATILE functions update data.  That may be overreaction, though.

Sorry I don't understand that. The volatile functions are not pushed
down to the foreign servers in the first place, no?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Fri, Sep 11, 2020 at 4:37 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

My initial patch used the same backend to resolve foreign
transactions. But in that case even though the user receives COMMIT
completed, the backend isn't accepting the next query till it is busy
resolving the foreign server. That might be a usability issue again if
attempting to resolve all foreign transactions takes noticeable time.
If we go this route, we should try to resolve as many foreign
transactions as possible ignoring any errors while doing so and
somehow let user know which transactions couldn't be resolved. User
can then take responsibility for resolving those.

>
> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
>
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I agree, this won't be acceptable.

In either case, I think a solution where the local server takes
responsibility to resolve foreign transactions will be better even in
the first cut.

--
Best Wishes,
Ashutosh Bapat



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > If so, can't we stipulate that the FDW implementor should ensure that the
> commit function always returns control to the caller?
> 
> How can the FDW implementor ensure that? Since even palloc could call
> ereport(ERROR) I guess it's hard to require that to all FDW
> implementors.

I think the what FDW commit routine will do is to just call xa_commit(), or PQexec("COMMIT PREPARED") in postgres_fdw.


> It's still a rough idea but I think we can use TMASYNC flag and
> xa_complete explained in the XA specification. The core transaction
> manager call prepare, commit, rollback APIs with the flag, requiring
> to execute the operation asynchronously and to return a handler (e.g.,
> a socket taken by PQsocket in postgres_fdw case) to the transaction
> manager. Then the transaction manager continues polling the handler
> until it becomes readable and testing the completion using by
> xa_complete() with no wait, until all foreign servers return OK on
> xa_complete check.

Unfortunately, even Oracle and Db2 don't support XA asynchronous execution for years.  Our DBMS Symfoware doesn't,
either. I don't expect other DBMSs support it.
 

Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious scale-out DBMS.  If we define asynchronous
FDWroutines for 2PC, postgres_fdw would be able to implement them by using libpq asynchronous functions.  But other
DBMSscan't ...
 


> > Maybe we can consider VOLATILE functions update data.  That may be
> overreaction, though.
> 
> Sorry I don't understand that. The volatile functions are not pushed
> down to the foreign servers in the first place, no?

Ah, you're right.  Then, the choices are twofold: (1) trust users in that their functions don't update data or the
user'sclaim (specification) about it, and (2) get notification through FE/BE protocol that the remote transaction may
haveupdated data.
 


Regards
Takayuki Tsunakawa


RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> The resolver process has two functionalities: resolving foreign
> transactions automatically when the user issues COMMIT (the case you
> described in the second paragraph), and resolving foreign transaction
> when the corresponding backend no longer exist or when the server
> crashes during in the middle of 2PC (described in the third
> paragraph).
> 
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

Yeah, the resolver background process -- someone independent of client sessions -- is necessary, because the client
sessiondisappears sometime.  When the server that hosts the 2PC coordinator crashes, there are no client sessions.  Our
DBMSSymfoware also runs background threads that take care of resolution of in-doubt transactions due to a server or
networkfailure.
 

Then, how does the resolver get involved in 2PC to enable parallel 2PC?  Two ideas quickly come to mind:

(1) Each client backend issues prepare and commit to multiple remote nodes asynchronously.
If the communication fails during commit, the client backend leaves the commit notification task to the resolver.
That is, the resolver lends a hand during failure recovery, and doesn't interfere with the transaction processing
duringnormal operation.
 

(2) The resolver takes some responsibility in 2PC processing during normal operation.
(send prepare and/or commit to remote nodes and get the results.)
To avoid serial execution per transaction, the resolver bundles multiple requests, send them in bulk, and wait for
multiplereplies at once.
 
This allows the coordinator to do its own prepare processing in parallel with those of participants.
However, in Postgres, this requires context switches between the client backend and the resolver.


Our Symfoware takes (2).  However, it doesn't suffer from the context switch, because the server is multi-threaded and
furtherimplements or uses more lightweight entities than the thread.
 


> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
> 
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I don't think it's a good design as you are worried.  I guess that's why Postgres-XL had to create a tool called
pgxc_cleanand ask the user to resolve transactions with it.
 

pgxc_clean
https://www.postgres-xl.org/documentation/pgxcclean.html

"pgxc_clean is a Postgres-XL utility to maintain transaction status after a crash. When a Postgres-XL node crashes and
recoversor fails over, the commit status of the node may be inconsistent with other nodes. pgxc_clean checks
transactioncommit status and corrects them."
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Michael Paquier
Date:
On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> Thank you for letting me know. I've attached the latest version patch set.

A rebase is needed again as the CF bot is complaining.
--
Michael

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 17 Sep 2020 at 14:25, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> > Thank you for letting me know. I've attached the latest version patch set.
>
> A rebase is needed again as the CF bot is complaining.

Thank you for letting me know. I'm updating the patch and splitting
into small pieces as Fujii-san suggested. I'll submit the latest patch
set early next week.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 16 Sep 2020 at 13:20, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > If so, can't we stipulate that the FDW implementor should ensure that the
> > commit function always returns control to the caller?
> >
> > How can the FDW implementor ensure that? Since even palloc could call
> > ereport(ERROR) I guess it's hard to require that to all FDW
> > implementors.
>
> I think the what FDW commit routine will do is to just call xa_commit(), or PQexec("COMMIT PREPARED") in
postgres_fdw.

Yes, but it still seems hard to me that we require for all FDW
implementations to commit/rollback prepared transactions without the
possibility of ERROR.

>
>
> > It's still a rough idea but I think we can use TMASYNC flag and
> > xa_complete explained in the XA specification. The core transaction
> > manager call prepare, commit, rollback APIs with the flag, requiring
> > to execute the operation asynchronously and to return a handler (e.g.,
> > a socket taken by PQsocket in postgres_fdw case) to the transaction
> > manager. Then the transaction manager continues polling the handler
> > until it becomes readable and testing the completion using by
> > xa_complete() with no wait, until all foreign servers return OK on
> > xa_complete check.
>
> Unfortunately, even Oracle and Db2 don't support XA asynchronous execution for years.  Our DBMS Symfoware doesn't,
either. I don't expect other DBMSs support it. 
>
> Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious scale-out DBMS.  If we define
asynchronousFDW routines for 2PC, postgres_fdw would be able to implement them by using libpq asynchronous functions.
Butother DBMSs can't ... 

I think it's not necessarily that all FDW implementations need to be
able to support xa_complete(). We can support both synchronous and
asynchronous executions of prepare/commit/rollback.

>
>
> > > Maybe we can consider VOLATILE functions update data.  That may be
> > overreaction, though.
> >
> > Sorry I don't understand that. The volatile functions are not pushed
> > down to the foreign servers in the first place, no?
>
> Ah, you're right.  Then, the choices are twofold: (1) trust users in that their functions don't update data or the
user'sclaim (specification) about it, and (2) get notification through FE/BE protocol that the remote transaction may
haveupdated data. 
>

I'm confused about the point you're concerned about the UDF function.
If you're concerned that executing a UDF function by like 'SELECT
myfunc();' updates data on a foreign server, since the UDF should know
which foreign server it modifies data on it should be able to register
the foreign server and mark as modified. Or you’re concerned that a
UDF function in WHERE condition is pushed down and updates data (e.g.,
 ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> Yes, but it still seems hard to me that we require for all FDW
> implementations to commit/rollback prepared transactions without the
> possibility of ERROR.

Of course we can't eliminate the possibility of error, because remote servers require network communication.  What I'm
sayingis to just require the FDW to return error like xa_commit(), not throwing control away with ereport(ERROR).  I
don'tthink it's too strict.
 


> I think it's not necessarily that all FDW implementations need to be
> able to support xa_complete(). We can support both synchronous and
> asynchronous executions of prepare/commit/rollback.

Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API.
 


> If you're concerned that executing a UDF function by like 'SELECT
> myfunc();' updates data on a foreign server, since the UDF should know
> which foreign server it modifies data on it should be able to register
> the foreign server and mark as modified. Or you’re concerned that a
> UDF function in WHERE condition is pushed down and updates data (e.g.,
>  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  Does the UDF call get pushed down to the
foreignserver in this case?  If not now, could it be pushed down in the future?  If it could be, it's worth considering
howto detect the remote update now.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Tue, Sep 22, 2020 at 6:48 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API. 

parallelism here has both pros and cons. If one of the servers errors
out while preparing for a transaction, there is no point in preparing
the transaction on other servers. In parallel execution we will
prepare on multiple servers before realising that one of them has
failed to do so. On the other hand preparing on multiple servers in
parallel provides a speed up.

But this can be an improvement on version 1. The current approach
doesn't render such an improvement impossible. So if that's something
hard to do, we should do that in the next version rather than
complicating this patch.

--
Best Wishes,
Ashutosh Bapat



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> parallelism here has both pros and cons. If one of the servers errors
> out while preparing for a transaction, there is no point in preparing
> the transaction on other servers. In parallel execution we will
> prepare on multiple servers before realising that one of them has
> failed to do so. On the other hand preparing on multiple servers in
> parallel provides a speed up.

And pros are dominant in practice.  If many transactions are erroring out (during prepare), the system is not
functioningfor the user.  Such an application should be corrected before they are put into production.
 


> But this can be an improvement on version 1. The current approach
> doesn't render such an improvement impossible. So if that's something
> hard to do, we should do that in the next version rather than
> complicating this patch.

Could you share your idea on how the current approach could enable parallelism?  This is an important point, because
(1)the FDW may not lead us to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or implementation
couldbe considered for non-parallel interaction if we have the realization of parallelism in mind.  I think that kind
ofconsideration is the design (for the future).
 


Regards
Takayuki Tsunakawa



Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Wed, Sep 23, 2020 at 2:13 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > parallelism here has both pros and cons. If one of the servers errors
> > out while preparing for a transaction, there is no point in preparing
> > the transaction on other servers. In parallel execution we will
> > prepare on multiple servers before realising that one of them has
> > failed to do so. On the other hand preparing on multiple servers in
> > parallel provides a speed up.
>
> And pros are dominant in practice.  If many transactions are erroring out (during prepare), the system is not
functioningfor the user.  Such an application should be corrected before they are put into production. 
>
>
> > But this can be an improvement on version 1. The current approach
> > doesn't render such an improvement impossible. So if that's something
> > hard to do, we should do that in the next version rather than
> > complicating this patch.
>
> Could you share your idea on how the current approach could enable parallelism?  This is an important point, because
(1)the FDW may not lead us to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or implementation
couldbe considered for non-parallel interaction if we have the realization of parallelism in mind.  I think that kind
ofconsideration is the design (for the future). 
>

The way I am looking at is to put the parallelism in the resolution
worker and not in the FDW. If we use multiple resolution workers, they
can fire commit/abort on multiple foreign servers at a time.

But if we want parallelism within a single resolution worker, we will
need a separate FDW APIs for firing asynchronous commit/abort prepared
txn and fetching their results resp. But given the variety of FDWs,
not all of them will support asynchronous API, so we have to support
synchronous API anyway, which is what can be targeted in the first
version.

Thinking more about it, the core may support an API which accepts a
list of prepared transactions, their foreign servers and user mappings
and let FDW resolve all those either in parallel or one by one. So
parallelism is responsibility of FDW and not the core. But then we
loose parallelism across FDWs, which may not be a common case.

Given the complications around this, I think we should go ahead
supporting synchronous API first and in second version introduce
optional asynchronous API.

--
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 22 Sep 2020 at 10:17, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > Yes, but it still seems hard to me that we require for all FDW
> > implementations to commit/rollback prepared transactions without the
> > possibility of ERROR.
>
> Of course we can't eliminate the possibility of error, because remote servers require network communication.  What
I'msaying is to just require the FDW to return error like xa_commit(), not throwing control away with ereport(ERROR).
Idon't think it's too strict. 

So with your idea, I think we require FDW developers to not call
ereport(ERROR) as much as possible. If they need to use a function
including palloc, lappend etc that could call ereport(ERROR), they
need to use PG_TRY() and PG_CATCH() and return the control along with
the error message to the transaction manager rather than raising an
error. Then the transaction manager will emit the error message at an
error level lower than ERROR (e.g., WARNING), and call commit/rollback
API again. But normally we do some cleanup on error but in this case
the retrying commit/rollback is performed without any cleanup. Is that
right? I’m not sure it’s safe though.

>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API. 
>
>
> > If you're concerned that executing a UDF function by like 'SELECT
> > myfunc();' updates data on a foreign server, since the UDF should know
> > which foreign server it modifies data on it should be able to register
> > the foreign server and mark as modified. Or you’re concerned that a
> > UDF function in WHERE condition is pushed down and updates data (e.g.,
> >  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?
>
> What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  Does the UDF call get pushed down to the
foreignserver in this case?  If not now, could it be pushed down in the future?  If it could be, it's worth considering
howto detect the remote update now. 

IIUC aggregation functions can be pushed down to the foreign server
but I have not idea the normal UDF in the select list is pushed down.
I wonder if it isn't.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> So with your idea, I think we require FDW developers to not call
> ereport(ERROR) as much as possible. If they need to use a function
> including palloc, lappend etc that could call ereport(ERROR), they
> need to use PG_TRY() and PG_CATCH() and return the control along with
> the error message to the transaction manager rather than raising an
> error. Then the transaction manager will emit the error message at an
> error level lower than ERROR (e.g., WARNING), and call commit/rollback
> API again. But normally we do some cleanup on error but in this case
> the retrying commit/rollback is performed without any cleanup. Is that
> right? I’m not sure it’s safe though.


Yes.  It's legitimate to require the FDW commit routine to return control, because the prepare of 2PC is a promise to
commitsuccessfully.  The second-phase commit should avoid doing that could fail.  For example, if some memory is needed
forcommit, it should be allocated in prepare or before.
 


> IIUC aggregation functions can be pushed down to the foreign server
> but I have not idea the normal UDF in the select list is pushed down.
> I wonder if it isn't.

Oh, that's the current situation.  Understood.  I thought the UDF call is also pushed down, as I saw Greenplum does so.
(Reading the manual, Greenplum disallows data updates in the UDF when it's executed on the remote segment server.)
 

(Aren't we overlooking something else that updates data on the remote server while the local server is unaware?)


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 18 Sep 2020 at 17:00, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 17 Sep 2020 at 14:25, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> > > Thank you for letting me know. I've attached the latest version patch set.
> >
> > A rebase is needed again as the CF bot is complaining.
>
> Thank you for letting me know. I'm updating the patch and splitting
> into small pieces as Fujii-san suggested. I'll submit the latest patch
> set early next week.
>

I've rebased the patch set and split into small pieces. Here are short
descriptions of each change:

v26-0001-Recreate-RemoveForeignServerById.patch

This commit recreates RemoveForeignServerById that was removed by
b1d32d3e3. This is necessary because we need to check if there is a
foreign transaction involved with the foreign server that is about to
be removed.

v26-0002-Introduce-transaction-manager-for-foreign-transa.patch

This commit adds the basic foreign transaction manager,
CommitForeignTransaction, and RollbackForeignTransaction API. These
APIs support only one-phase. With this change, FDW is able to control
its transaction using the foreign transaction manager, not using
XactCallback.

v26-0003-postgres_fdw-supports-commit-and-rollback-APIs.patch

This commit implements both CommitForeignTransaction and
RollbackForeignTransaction APIs in postgres_fdw. Note that since
PREPARE TRANSACTION is still not supported there is nothing the user
newly is able to do.

v26-0004-Add-PrepareForeignTransaction-API.patch

This commit adds prepared foreign transaction support including WAL
logging and recovery, and PrepareForeignTransaction API. With this
change, the user is able to do 'PREPARE TRANSACTION' and
'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
local transaction. It doesn't do anything for foreign transactions.
Therefore, the user needs to resolve foreign transactions manually by
executing the pg_resolve_foreign_xacts() SQL function which is also
introduced by this commit.

v26-0005-postgres_fdw-supports-prepare-API-and-support-co.patch

This commit implements PrepareForeignTransaction API and makes
CommitForeignTransaction and RollbackForeignTransaction supports
two-phase commit.

v26-0006-Add-GetPrepareID-API.patch

This commit adds GetPrepareID API.

v26-0007-Automatic-foreign-transaciton-resolution-on-COMM.patch

This commit adds the automatic foreign transaction resolution on
COMMIT/ROLLBACK PREPARED by using foreign transaction resolver and
launcher processes. With this change, the user is able to
commit/rollback the distributed transaction by COMMIT/ROLLBACK
PREPARED without manual resolution. The involved foreign transactions
are automatically resolved by a resolver process.

v26-0008-Automatic-foreign-transaciton-resolution-on-comm.patch

This commit adds the automatic foreign transaction resolution on
commit/rollback. With this change, the user is able to commit the
foreign transactions automatically on commit without executing PREPARE
TRANSACTION when foreign_twophase_commit is 'required'. IOW, we can
guarantee that all foreign transactions had been resolved when the
user got an acknowledgment of COMMIT.

v26-0009-postgres_fdw-supports-automatically-resolution.patch

This commit makes postgres_fdw supports the 0008 change.

v26-0010-Documentation-update.patch
v26-0011-Add-regression-tests-for-foreign-twophase-commit.patch

The above commits are documentation update and regression tests.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> The way I am looking at is to put the parallelism in the resolution
> worker and not in the FDW. If we use multiple resolution workers, they
> can fire commit/abort on multiple foreign servers at a time.

From a single session's view, yes.  However, the requests from multiple sessions are processed one at a time within
eachresolver, because the resolver has to call the synchronous FDW prepare/commit routines and wait for the response
fromthe remote server.  That's too limiting.
 


> But if we want parallelism within a single resolution worker, we will
> need a separate FDW APIs for firing asynchronous commit/abort prepared
> txn and fetching their results resp. But given the variety of FDWs,
> not all of them will support asynchronous API, so we have to support
> synchronous API anyway, which is what can be targeted in the first
> version.

I agree in that most FDWs will be unlikely to have asynchronous prepare/commit functions, as demonstrated by the fact
thateven Oracle and Db2 don't implement XA asynchronous APIs.  That's one problem of using FDW for Postgres scale-out.
Whenwe enhance FDW, we have to take care of other DBMSs to make the FDW interface practical.  OTOH, we want to make
maximumuse of Postgres features, such as libpq asynchronous API, to make Postgres scale-out as performant as possible.
Butthe scale-out design is bound by the FDW interface.  I don't feel accepting such less performant design is an
attitudeof this community, as people here are strict against even 1 or 2 percent performance drop.
 


> Thinking more about it, the core may support an API which accepts a
> list of prepared transactions, their foreign servers and user mappings
> and let FDW resolve all those either in parallel or one by one. So
> parallelism is responsibility of FDW and not the core. But then we
> loose parallelism across FDWs, which may not be a common case.

Hmm, I understand asynchronous FDW relation scan is being developed now, in the form of cooperation between the FDW and
theexecutor.  If we make just the FDW responsible for prepare/commit parallelism, the design becomes asymmetric.  As
yousay, I'm not sure if the parallelism is wanted among different types, say, Postgres and Oracle.  In fact, major
DBMSsdon't implement XA asynchronous API.  But such lack of parallelism may be one cause of the bad reputation that 2PC
(ofXA) is slow.
 


> Given the complications around this, I think we should go ahead
> supporting synchronous API first and in second version introduce
> optional asynchronous API.

How about the following?

* Add synchronous and asynchronous versions of prepare/commit/abort routines and a routine to wait for completion of
asynchronousexecution in FdwRoutine.  They are optional.
 
postgres_fdw can implement the asynchronous routines using libpq asynchronous functions.  Other DBMSs can implement XA
asynchronousAPI for them in theory.
 

* The client backend uses asynchronous FDW routines if available:

/* Issue asynchronous prepare | commit | rollback to FDWs that support it */
foreach (per each foreign server used in the transaction)
{
    if (fdwroutine->{prepare | commit | rollback}_async_func)
        fdwroutine->{prepare | commit | rollback}_async_func(...);
}

/* Wait for completion of asynchronous prepare | commit | rollback */
foreach (per each foreign server used in the transaction)
{
    if (fdwroutine->{prepare | commit | rollback}_async_func)
        ret = fdwroutine->wait_for_completion(...);
}

/* Issue synchronous prepare | commit | rollback to FDWs that don't support it */
foreach (per each foreign server used in the transaction)
{
    if (fdwroutine->{prepare | commit | rollback}_async_func == NULL)
        ret = fdwroutine->{prepare | commit | rollback}_func(...);
}

* The client backend asks the resolver to commit or rollback the remote transaction only when the remote transaction
fails(due to the failure of remote server or network.)  That is, the resolver is not involved during normal operation.
 


This will not be complex, and can be included in the first version, if we really want to use FDW for Postgres
scale-out.


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 24 Sep 2020 at 17:23, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > So with your idea, I think we require FDW developers to not call
> > ereport(ERROR) as much as possible. If they need to use a function
> > including palloc, lappend etc that could call ereport(ERROR), they
> > need to use PG_TRY() and PG_CATCH() and return the control along with
> > the error message to the transaction manager rather than raising an
> > error. Then the transaction manager will emit the error message at an
> > error level lower than ERROR (e.g., WARNING), and call commit/rollback
> > API again. But normally we do some cleanup on error but in this case
> > the retrying commit/rollback is performed without any cleanup. Is that
> > right? I’m not sure it’s safe though.
>
>
> Yes.  It's legitimate to require the FDW commit routine to return control, because the prepare of 2PC is a promise to
commitsuccessfully.  The second-phase commit should avoid doing that could fail.  For example, if some memory is needed
forcommit, it should be allocated in prepare or before. 
>

I don't think it's always possible to avoid raising errors in advance.
Considering how postgres_fdw can implement your idea, I think
postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
management. It has a connection cache in the local memory using HTAB.
It needs to create an entry for the first time to connect (e.g., when
prepare and commit prepared a transaction are performed by different
processes) and it needs to re-connect the foreign server when the
entry is invalidated. In both cases, ERROR could happen. I guess the
same is true for other FDW implementations. Possibly other FDWs might
need more work for example cleanup or releasing resources. I think
that the pros of your idea are to make the transaction manager simple
since we don't need resolvers and launcher but the cons are to bring
the complexity to FDW implementation codes instead. Also, IMHO I don't
think it's safe way that FDW does neither re-throwing an error nor
abort transaction when an error occurs.

In terms of performance you're concerned, I wonder if we can somewhat
eliminate the bottleneck if multiple resolvers are able to run on one
database in the future. For example, if we could launch resolver
processes as many as connections on the database, individual backend
processes could have one resolver process. Since there would be
contention and inter-process communication it still brings some
overhead but it might be negligible comparing to network round trip.

Perhaps we can hear more opinions on that from other hackers to decide
the FDW transaction API design.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I don't think it's always possible to avoid raising errors in advance.
> Considering how postgres_fdw can implement your idea, I think
> postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
> management. It has a connection cache in the local memory using HTAB.
> It needs to create an entry for the first time to connect (e.g., when
> prepare and commit prepared a transaction are performed by different
> processes) and it needs to re-connect the foreign server when the
> entry is invalidated. In both cases, ERROR could happen. I guess the
> same is true for other FDW implementations. Possibly other FDWs might
> need more work for example cleanup or releasing resources. I think

Why does the client backend have to create a new connection cache entry during PREPARE or COMMIT PREPARE?  Doesn't the
clientbackend naturally continue to use connections that it has used in its current transaction?
 


> that the pros of your idea are to make the transaction manager simple
> since we don't need resolvers and launcher but the cons are to bring
> the complexity to FDW implementation codes instead. Also, IMHO I don't
> think it's safe way that FDW does neither re-throwing an error nor
> abort transaction when an error occurs.

No, I didn't say the resolver is unnecessary.  The resolver takes care of terminating remote transactions when the
clientbackend encountered an error during COMMIT/ROLLBACK PREPARED.
 


> In terms of performance you're concerned, I wonder if we can somewhat
> eliminate the bottleneck if multiple resolvers are able to run on one
> database in the future. For example, if we could launch resolver
> processes as many as connections on the database, individual backend
> processes could have one resolver process. Since there would be
> contention and inter-process communication it still brings some
> overhead but it might be negligible comparing to network round trip.

Do you mean that if concurrent 200 clients each update data on two foreign servers, there are 400 resolvers?  ...That's
overuseof resources.
 


Regards
Takayuki Tsunakawa

    

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 25 Sep 2020 at 18:21, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > I don't think it's always possible to avoid raising errors in advance.
> > Considering how postgres_fdw can implement your idea, I think
> > postgres_fdw would need PG_TRY() and PG_CATCH() for its connection
> > management. It has a connection cache in the local memory using HTAB.
> > It needs to create an entry for the first time to connect (e.g., when
> > prepare and commit prepared a transaction are performed by different
> > processes) and it needs to re-connect the foreign server when the
> > entry is invalidated. In both cases, ERROR could happen. I guess the
> > same is true for other FDW implementations. Possibly other FDWs might
> > need more work for example cleanup or releasing resources. I think
>
> Why does the client backend have to create a new connection cache entry during PREPARE or COMMIT PREPARE?  Doesn't
theclient backend naturally continue to use connections that it has used in its current transaction?
 

I think there are two cases: a process executes PREPARE TRANSACTION
and another process executes COMMIT PREPARED later, and if the
coordinator has cascaded foreign servers (i.g., a foreign server has
its foreign server) and temporary connection problem happens in the
intermediate node after PREPARE then another process on the
intermediate node will execute COMMIT PREPARED on its foreign server.

>
>
> > that the pros of your idea are to make the transaction manager simple
> > since we don't need resolvers and launcher but the cons are to bring
> > the complexity to FDW implementation codes instead. Also, IMHO I don't
> > think it's safe way that FDW does neither re-throwing an error nor
> > abort transaction when an error occurs.
>
> No, I didn't say the resolver is unnecessary.  The resolver takes care of terminating remote transactions when the
clientbackend encountered an error during COMMIT/ROLLBACK PREPARED.
 

Understood. With your idea, we can remove at least the code of making
backend wait and inter-process communication between backends and
resolvers.

I think we need to consider that it's really safe and what needs to
achieve your idea safely.

>
>
> > In terms of performance you're concerned, I wonder if we can somewhat
> > eliminate the bottleneck if multiple resolvers are able to run on one
> > database in the future. For example, if we could launch resolver
> > processes as many as connections on the database, individual backend
> > processes could have one resolver process. Since there would be
> > contention and inter-process communication it still brings some
> > overhead but it might be negligible comparing to network round trip.
>
> Do you mean that if concurrent 200 clients each update data on two foreign servers, there are 400 resolvers?
...That'soveruse of resources.
 

I think we have 200 resolvers in this case since one resolver process
per backend process. Or another idea is that all processes queue
foreign transactions to resolve into the shared memory queue and
resolver processes fetch and resolve them instead of assigning one
distributed transaction to one resolver process. Using asynchronous
execution, the resolver process can process a bunch of foreign
transactions across distributed transactions and grouped by the
foreign server at once. It might be more complex than the current
approach but having multiple resolver processes on one database would
increase through-put well especially by combining with asynchronous
execution.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> On Fri, 25 Sep 2020 at 18:21, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Why does the client backend have to create a new connection cache entry
> during PREPARE or COMMIT PREPARE?  Doesn't the client backend naturally
> continue to use connections that it has used in its current transaction?
> 
> I think there are two cases: a process executes PREPARE TRANSACTION
> and another process executes COMMIT PREPARED later, and if the
> coordinator has cascaded foreign servers (i.g., a foreign server has
> its foreign server) and temporary connection problem happens in the
> intermediate node after PREPARE then another process on the
> intermediate node will execute COMMIT PREPARED on its foreign server.

Aren't both the cases failure cases, and thus handled by the resolver?


> > > In terms of performance you're concerned, I wonder if we can somewhat
> > > eliminate the bottleneck if multiple resolvers are able to run on one
> > > database in the future. For example, if we could launch resolver
> > > processes as many as connections on the database, individual backend
> > > processes could have one resolver process. Since there would be
> > > contention and inter-process communication it still brings some
> > > overhead but it might be negligible comparing to network round trip.
> >
> > Do you mean that if concurrent 200 clients each update data on two foreign
> servers, there are 400 resolvers?  ...That's overuse of resources.
> 
> I think we have 200 resolvers in this case since one resolver process
> per backend process.

That does not parallelize prepare or commit for a single client, as each resolver can process only one prepare or
commitsynchronously at a time.  Not to mention the resource usage is high.
 


> Or another idea is that all processes queue
> foreign transactions to resolve into the shared memory queue and
> resolver processes fetch and resolve them instead of assigning one
> distributed transaction to one resolver process. Using asynchronous
> execution, the resolver process can process a bunch of foreign
> transactions across distributed transactions and grouped by the
> foreign server at once. It might be more complex than the current
> approach but having multiple resolver processes on one database would
> increase through-put well especially by combining with asynchronous
> execution.

Yeah, that sounds complex.  It's simpler and natural for each client backend to use the connections it has used in its
currenttransaction and issue prepare and commit to the foreign servers, and the resolver just takes care of failed
commitsand aborts behind the scenes.  That's like the walwriter takes care of writing WAL based on the client backend
thatcommits asynchronously.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 28 Sep 2020 at 13:58, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > On Fri, 25 Sep 2020 at 18:21, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Why does the client backend have to create a new connection cache entry
> > during PREPARE or COMMIT PREPARE?  Doesn't the client backend naturally
> > continue to use connections that it has used in its current transaction?
> >
> > I think there are two cases: a process executes PREPARE TRANSACTION
> > and another process executes COMMIT PREPARED later, and if the
> > coordinator has cascaded foreign servers (i.g., a foreign server has
> > its foreign server) and temporary connection problem happens in the
> > intermediate node after PREPARE then another process on the
> > intermediate node will execute COMMIT PREPARED on its foreign server.
>
> Aren't both the cases failure cases, and thus handled by the resolver?

No. Please imagine a case where a user executes PREPARE TRANSACTION on
the transaction that modified data on foreign servers. The backend
process prepares both the local transaction and foreign transactions.
But another client can execute COMMIT PREPARED on the prepared
transaction. In this case, another backend newly connects foreign
servers and commits prepared foreign transactions. Therefore, the new
connection cache entry can be created during COMMIT PREPARED which
could lead to an error but since the local prepared transaction is
already committed the backend must not fail with an error.

In the latter case, I’m assumed that the backend continues to retry
foreign transaction resolution until the user requests cancellation.
Please imagine the case where the server-A connects a foreign server
(say, server-B) and server-B connects another foreign server (say,
server-C). The transaction initiated on server-A modified the data on
both local and server-B which further modified the data on server-C
and executed COMMIT.  The backend process on server-A (say, backend-A)
sends PREPARE TRANSACTION to server-B then the backend process  on
server-B (say, backend-B) connected by backend-A prepares the local
transaction and further sends PREPARE TRANSACTION to server-C. Let’s
suppose a temporary connection failure happens between server-A and
server-B before the backend-A sending COMMIT PREPARED (i.g, 2nd phase
of 2PC). When the backend-A attempts to sends COMMIT PREPARED to
server-B it realizes that the connection to server-B was lost but
since the user doesn’t request cancellatino yet the backend-A retries
to connect server-B and suceeds. Since now that the backend-A
established a new connection to server-B, there is another backend
process on server-B (say, backend-B’). Since the backend-B’ doen’t
have a connection to server-C yet, it creates new connection cache
entry, which could lead to an error.  IOW, on server-B different
processes performed PREPARE TRANSACTION and COMMIT PREPARED and the
later process created a connection cache entry.

>
>
> > > > In terms of performance you're concerned, I wonder if we can somewhat
> > > > eliminate the bottleneck if multiple resolvers are able to run on one
> > > > database in the future. For example, if we could launch resolver
> > > > processes as many as connections on the database, individual backend
> > > > processes could have one resolver process. Since there would be
> > > > contention and inter-process communication it still brings some
> > > > overhead but it might be negligible comparing to network round trip.
> > >
> > > Do you mean that if concurrent 200 clients each update data on two foreign
> > servers, there are 400 resolvers?  ...That's overuse of resources.
> >
> > I think we have 200 resolvers in this case since one resolver process
> > per backend process.
>
> That does not parallelize prepare or commit for a single client, as each resolver can process only one prepare or
commitsynchronously at a time.  Not to mention the resource usage is high. 

Well, I think we should discuss parallel (and/or asyncronous)
execution of prepare and commit separated from the discussion on
whether the resolver process is responsible for 2nd phase of 2PC. I've
been suggesting that the first phase and the second phase of 2PC
should be performed by different processes in terms of safety. And
having multiple resolvers on one database is my suggestion in response
to the concern you raised that one resolver process on one database
can be bottleneck. Both parallel executionand asynchronous execution
are slightly related to this topic but I think it should be discussed
separately.

Regarding parallel and asynchronous execution, I basically agree on
supporting asynchronous execution as the XA specification also has,
although I think it's better not to include it in the first version
for simplisity.

Overall, my suggestion for the first version is to support synchronous
execution of prepare, commit, and rollback, have one resolver process
per database, and have resolver take 2nd phase of 2PC. As the next
step we can add APIs for asynchronous execution, have multiple
resolvers on one database and so on.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> No. Please imagine a case where a user executes PREPARE TRANSACTION on
> the transaction that modified data on foreign servers. The backend
> process prepares both the local transaction and foreign transactions.
> But another client can execute COMMIT PREPARED on the prepared
> transaction. In this case, another backend newly connects foreign
> servers and commits prepared foreign transactions. Therefore, the new
> connection cache entry can be created during COMMIT PREPARED which
> could lead to an error but since the local prepared transaction is
> already committed the backend must not fail with an error.
> 
> In the latter case, I’m assumed that the backend continues to retry
> foreign transaction resolution until the user requests cancellation.
> Please imagine the case where the server-A connects a foreign server
> (say, server-B) and server-B connects another foreign server (say,
> server-C). The transaction initiated on server-A modified the data on
> both local and server-B which further modified the data on server-C
> and executed COMMIT.  The backend process on server-A (say, backend-A)
> sends PREPARE TRANSACTION to server-B then the backend process  on
> server-B (say, backend-B) connected by backend-A prepares the local
> transaction and further sends PREPARE TRANSACTION to server-C. Let’s
> suppose a temporary connection failure happens between server-A and
> server-B before the backend-A sending COMMIT PREPARED (i.g, 2nd phase
> of 2PC). When the backend-A attempts to sends COMMIT PREPARED to
> server-B it realizes that the connection to server-B was lost but
> since the user doesn’t request cancellatino yet the backend-A retries
> to connect server-B and suceeds. Since now that the backend-A
> established a new connection to server-B, there is another backend
> process on server-B (say, backend-B’). Since the backend-B’ doen’t
> have a connection to server-C yet, it creates new connection cache
> entry, which could lead to an error.  IOW, on server-B different
> processes performed PREPARE TRANSACTION and COMMIT PREPARED and
> the
> later process created a connection cache entry.

Thank you, I understood the situation.  I don't think it's a good design to not address practical performance during
normaloperation by fearing the rare error case.
 

The transaction manager (TM) or the FDW implementor can naturally do things like the following:

* Use palloc_extended(MCXT_ALLOC_NO_OOM) and hash_search(HASH_ENTER_NULL) to return control to the caller.

* Use PG_TRY(), as its overhead is relatively negligible to connection establishment.

* If the commit fails, the TM asks the resolver to take care of committing the remote transaction, and returns success
tothe user.
 


> Regarding parallel and asynchronous execution, I basically agree on
> supporting asynchronous execution as the XA specification also has,
> although I think it's better not to include it in the first version
> for simplisity.
> 
> Overall, my suggestion for the first version is to support synchronous
> execution of prepare, commit, and rollback, have one resolver process
> per database, and have resolver take 2nd phase of 2PC. As the next
> step we can add APIs for asynchronous execution, have multiple
> resolvers on one database and so on.

We don't have to rush to commit a patch that is likely to exhibit non-practical performance, as we still have much time
leftfor PG 14.  The design needs to be more thought for the ideal goal and refined.  By making efforts to sort through
theideal design, we may be able to avoid rework and API inconsistency.  As for the API, we haven't validated yet that
theFDW implementor can use XA, have we?
 



Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 29 Sep 2020 at 11:37, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > No. Please imagine a case where a user executes PREPARE TRANSACTION on
> > the transaction that modified data on foreign servers. The backend
> > process prepares both the local transaction and foreign transactions.
> > But another client can execute COMMIT PREPARED on the prepared
> > transaction. In this case, another backend newly connects foreign
> > servers and commits prepared foreign transactions. Therefore, the new
> > connection cache entry can be created during COMMIT PREPARED which
> > could lead to an error but since the local prepared transaction is
> > already committed the backend must not fail with an error.
> >
> > In the latter case, I’m assumed that the backend continues to retry
> > foreign transaction resolution until the user requests cancellation.
> > Please imagine the case where the server-A connects a foreign server
> > (say, server-B) and server-B connects another foreign server (say,
> > server-C). The transaction initiated on server-A modified the data on
> > both local and server-B which further modified the data on server-C
> > and executed COMMIT.  The backend process on server-A (say, backend-A)
> > sends PREPARE TRANSACTION to server-B then the backend process  on
> > server-B (say, backend-B) connected by backend-A prepares the local
> > transaction and further sends PREPARE TRANSACTION to server-C. Let’s
> > suppose a temporary connection failure happens between server-A and
> > server-B before the backend-A sending COMMIT PREPARED (i.g, 2nd phase
> > of 2PC). When the backend-A attempts to sends COMMIT PREPARED to
> > server-B it realizes that the connection to server-B was lost but
> > since the user doesn’t request cancellatino yet the backend-A retries
> > to connect server-B and suceeds. Since now that the backend-A
> > established a new connection to server-B, there is another backend
> > process on server-B (say, backend-B’). Since the backend-B’ doen’t
> > have a connection to server-C yet, it creates new connection cache
> > entry, which could lead to an error.  IOW, on server-B different
> > processes performed PREPARE TRANSACTION and COMMIT PREPARED and
> > the
> > later process created a connection cache entry.
>
> Thank you, I understood the situation.  I don't think it's a good design to not address practical performance during
normaloperation by fearing the rare error case. 
>
> The transaction manager (TM) or the FDW implementor can naturally do things like the following:
>
> * Use palloc_extended(MCXT_ALLOC_NO_OOM) and hash_search(HASH_ENTER_NULL) to return control to the caller.
>
> * Use PG_TRY(), as its overhead is relatively negligible to connection establishment.

I suppose you mean that the FDW implementor uses PG_TRY() to catch an
error but not do PG_RE_THROW(). I'm concerned that it's safe to return
the control to the caller and continue trying to resolve foreign
transactions without neither rethrowing an error nor transaction
abort.

IMHO, it's rather a bad design something like "high performance but
doesn't work fine in a rare failure case", especially for the
transaction management feature.

>
> * If the commit fails, the TM asks the resolver to take care of committing the remote transaction, and returns
successto the user. 
>
>
> > Regarding parallel and asynchronous execution, I basically agree on
> > supporting asynchronous execution as the XA specification also has,
> > although I think it's better not to include it in the first version
> > for simplisity.
> >
> > Overall, my suggestion for the first version is to support synchronous
> > execution of prepare, commit, and rollback, have one resolver process
> > per database, and have resolver take 2nd phase of 2PC. As the next
> > step we can add APIs for asynchronous execution, have multiple
> > resolvers on one database and so on.
>
> We don't have to rush to commit a patch that is likely to exhibit non-practical performance, as we still have much
timeleft for PG 14.  The design needs to be more thought for the ideal goal and refined.  By making efforts to sort
throughthe ideal design, we may be able to avoid rework and API inconsistency.  As for the API, we haven't validated
yetthat the FDW implementor can use XA, have we? 

Yes, we still need to check if FDW implementor other than postgres_fdw
is able to support these APIs. I agree that we need more discussion on
the design. My suggestion is to start a small, simple feature as the
first step and not try to include everything in the first version.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 29 Sep 2020 at 15:03, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 29 Sep 2020 at 11:37, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > No. Please imagine a case where a user executes PREPARE TRANSACTION on
> > > the transaction that modified data on foreign servers. The backend
> > > process prepares both the local transaction and foreign transactions.
> > > But another client can execute COMMIT PREPARED on the prepared
> > > transaction. In this case, another backend newly connects foreign
> > > servers and commits prepared foreign transactions. Therefore, the new
> > > connection cache entry can be created during COMMIT PREPARED which
> > > could lead to an error but since the local prepared transaction is
> > > already committed the backend must not fail with an error.
> > >
> > > In the latter case, I’m assumed that the backend continues to retry
> > > foreign transaction resolution until the user requests cancellation.
> > > Please imagine the case where the server-A connects a foreign server
> > > (say, server-B) and server-B connects another foreign server (say,
> > > server-C). The transaction initiated on server-A modified the data on
> > > both local and server-B which further modified the data on server-C
> > > and executed COMMIT.  The backend process on server-A (say, backend-A)
> > > sends PREPARE TRANSACTION to server-B then the backend process  on
> > > server-B (say, backend-B) connected by backend-A prepares the local
> > > transaction and further sends PREPARE TRANSACTION to server-C. Let’s
> > > suppose a temporary connection failure happens between server-A and
> > > server-B before the backend-A sending COMMIT PREPARED (i.g, 2nd phase
> > > of 2PC). When the backend-A attempts to sends COMMIT PREPARED to
> > > server-B it realizes that the connection to server-B was lost but
> > > since the user doesn’t request cancellatino yet the backend-A retries
> > > to connect server-B and suceeds. Since now that the backend-A
> > > established a new connection to server-B, there is another backend
> > > process on server-B (say, backend-B’). Since the backend-B’ doen’t
> > > have a connection to server-C yet, it creates new connection cache
> > > entry, which could lead to an error.  IOW, on server-B different
> > > processes performed PREPARE TRANSACTION and COMMIT PREPARED and
> > > the
> > > later process created a connection cache entry.
> >
> > Thank you, I understood the situation.  I don't think it's a good design to not address practical performance
duringnormal operation by fearing the rare error case. 
> >
> > The transaction manager (TM) or the FDW implementor can naturally do things like the following:
> >
> > * Use palloc_extended(MCXT_ALLOC_NO_OOM) and hash_search(HASH_ENTER_NULL) to return control to the caller.
> >
> > * Use PG_TRY(), as its overhead is relatively negligible to connection establishment.
>
> I suppose you mean that the FDW implementor uses PG_TRY() to catch an
> error but not do PG_RE_THROW(). I'm concerned that it's safe to return
> the control to the caller and continue trying to resolve foreign
> transactions without neither rethrowing an error nor transaction
> abort.
>
> IMHO, it's rather a bad design something like "high performance but
> doesn't work fine in a rare failure case", especially for the
> transaction management feature.

To avoid misunderstanding, I didn't mean to disregard the performance.
I mean especially for the transaction management feature it's
essential to work fine even in failure cases. So I hope we have a
safe, robust, and probably simple design for the first version that
might be low performance yet though but have a potential for
performance improvement and we will be able to try to improve
performance later.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> To avoid misunderstanding, I didn't mean to disregard the performance.
> I mean especially for the transaction management feature it's
> essential to work fine even in failure cases. So I hope we have a
> safe, robust, and probably simple design for the first version that
> might be low performance yet though but have a potential for
> performance improvement and we will be able to try to improve
> performance later.

Yes, correctness (safety?) is a basic premise.  I understand that given the time left for PG 14, we haven't yet given
upa sound design that offers practical or normally expected performance.  I don't think the design has not well thought
yetto see if it's simple or complex.  At least, I don't believe doing "send commit request, perform commit on a remote
server,and wait for reply" sequence one transaction at a time in turn is what this community (and other DBMSs)
tolerate. A kid's tricycle is safe, but it's not safe to ride a tricycle on the road.  Let's not rush to commit and do
ourbest!
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 30 Sep 2020 at 16:02, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > To avoid misunderstanding, I didn't mean to disregard the performance.
> > I mean especially for the transaction management feature it's
> > essential to work fine even in failure cases. So I hope we have a
> > safe, robust, and probably simple design for the first version that
> > might be low performance yet though but have a potential for
> > performance improvement and we will be able to try to improve
> > performance later.
>
> Yes, correctness (safety?) is a basic premise.  I understand that given the time left for PG 14, we haven't yet given
upa sound design that offers practical or normally expected performance.  I don't think the design has not well thought
yetto see if it's simple or complex.  At least, I don't believe doing "send commit request, perform commit on a remote
server,and wait for reply" sequence one transaction at a time in turn is what this community (and other DBMSs)
tolerate. A kid's tricycle is safe, but it's not safe to ride a tricycle on the road.  Let's not rush to commit and do
ourbest! 

Okay. I'd like to resolve my concern that I repeatedly mentioned and
we don't find a good solution yet. That is, how we handle errors
raised by FDW transaction callbacks during committing/rolling back
prepared foreign transactions. Actually, this has already been
discussed before[1] and we concluded at that time that using a
background worker to commit/rolling back foreign prepared transactions
is the best way.

Anyway, let me summarize the discussion on this issue so far. With
your idea, after the local commit, the backend process directly call
transaction FDW API to commit the foreign prepared transactions.
However, it's likely to happen an error (i.g. ereport(ERROR)) during
that due to various reasons. It could be an OOM by memory allocation,
connection error whatever. In case an error happens during committing
prepared foreign transactions, the user will get the error but it's
too late. The local transaction and possibly other foreign prepared
transaction have already been committed. You proposed the first idea
to avoid such a situation that FDW implementor can write the code
while trying to reduce the possibility of errors happening as much as
possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
solution. They might miss, not know it, or use other functions
provided by the core that could lead an error. Another idea is to use
PG_TRY() and PG_CATCH(). IIUC with this idea, FDW implementor catches
an error but ignores it rather than rethrowing by PG_RE_THROW() in
order to return the control to the core after an error. I’m really not
sure it’s a correct usage of those macros. In addition, after
returning to the core, it will retry to resolve the same or other
foreign transactions. That is, after ignoring an error, the core needs
to continue working and possibly call transaction callbacks of other
FDW implementations.

Regards,

[1]
https://www.postgresql.org/message-id/CA%2BTgmoY%3DVkHrzXD%3Djw5DA%2BPp-ePW_6_v5n%2BTJk40s5Q9VXY-Pw%40mail.gmail.com

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> You proposed the first idea
> to avoid such a situation that FDW implementor can write the code
> while trying to reduce the possibility of errors happening as much as
> possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
> hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
> solution. They might miss, not know it, or use other functions
> provided by the core that could lead an error.

We can give the guideline in the manual, can't we?  It should not be especially difficult for the FDW implementor
comparedto other Postgres's extensibility features that have their own rules -- table/index AM, user-defined C
function,trigger function in C, user-defined data types, hooks, etc.  And, the Postgres functions that the FDW
implementorwould use to implement their commit will be very limited, won't they?  Because most of the commit processing
isperformed in the resource manager's library (e.g. Oracle and MySQL client library.)
 

(Before that, the developer of server-side modules is not given any information on what functions (like palloc) are
availablein the manual, is he?)
 


> Another idea is to use
> PG_TRY() and PG_CATCH(). IIUC with this idea, FDW implementor catches
> an error but ignores it rather than rethrowing by PG_RE_THROW() in
> order to return the control to the core after an error. I’m really not
> sure it’s a correct usage of those macros. In addition, after
> returning to the core, it will retry to resolve the same or other
> foreign transactions. That is, after ignoring an error, the core needs
> to continue working and possibly call transaction callbacks of other
> FDW implementations.

No, not ignore the error.  The FDW can emit a WARNING, LOG, or NOTICE message, and return an error code to TM.  TM can
alsoemit a message like:
 

WARNING:  failed to commit part of a transaction on the foreign server 'XXX'
HINT:  The server continues to try committing the remote transaction.

Then TM asks the resolver to take care of committing the remote transaction, and acknowledge the commit success to the
client. The relevant return codes of xa_commit() are:
 

--------------------------------------------------
[XAER_RMERR] 
An error occurred in committing the work performed on behalf of the transaction 
branch and the branch’s work has been rolled back. Note that returning this error 
signals a catastrophic event to a transaction manager since other resource 
managers may successfully commit their work on behalf of this branch. This error 
should be returned only when a resource manager concludes that it can never 
commit the branch and that it cannot hold the branch’s resources in a prepared 
state. Otherwise, [XA_RETRY] should be returned. 

[XAER_RMFAIL] 
An error occurred that makes the resource manager unavailable. 
--------------------------------------------------


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 2 Oct 2020 at 18:20, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > You proposed the first idea
> > to avoid such a situation that FDW implementor can write the code
> > while trying to reduce the possibility of errors happening as much as
> > possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
> > hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
> > solution. They might miss, not know it, or use other functions
> > provided by the core that could lead an error.
>
> We can give the guideline in the manual, can't we?  It should not be especially difficult for the FDW implementor
comparedto other Postgres's extensibility features that have their own rules -- table/index AM, user-defined C
function,trigger function in C, user-defined data types, hooks, etc.  And, the Postgres functions that the FDW
implementorwould use to implement their commit will be very limited, won't they?  Because most of the commit processing
isperformed in the resource manager's library (e.g. Oracle and MySQL client library.) 

Yeah, if we think FDW implementors properly implement these APIs while
following the guideline, giving the guideline is a good idea. But I’m
not sure all FDW implementors are able to do that and even if the user
uses an FDW whose transaction APIs don’t follow the guideline, the
user won’t realize it. IMO it’s better to design the feature while not
depending on external programs for reliability (correctness?) of this
feature, although I might be too worried.

>
>
> > Another idea is to use
> > PG_TRY() and PG_CATCH(). IIUC with this idea, FDW implementor catches
> > an error but ignores it rather than rethrowing by PG_RE_THROW() in
> > order to return the control to the core after an error. I’m really not
> > sure it’s a correct usage of those macros. In addition, after
> > returning to the core, it will retry to resolve the same or other
> > foreign transactions. That is, after ignoring an error, the core needs
> > to continue working and possibly call transaction callbacks of other
> > FDW implementations.
>
> No, not ignore the error.  The FDW can emit a WARNING, LOG, or NOTICE message, and return an error code to TM.  TM
canalso emit a message like: 
>
> WARNING:  failed to commit part of a transaction on the foreign server 'XXX'
> HINT:  The server continues to try committing the remote transaction.
>
> Then TM asks the resolver to take care of committing the remote transaction, and acknowledge the commit success to
theclient. 

It seems like if failed to resolve, the backend would return an
acknowledgment of COMMIT to the client and the resolver process
resolves foreign prepared transactions in the background. So we can
ensure that the distributed transaction is completed at the time when
the client got an acknowledgment of COMMIT if 2nd phase of 2PC is
successfully completed in the first attempts. OTOH, if it failed for
whatever reason, there is no such guarantee. From an optimistic
perspective, i.g., the failures are unlikely to happen, it will work
well but IMO it’s not uncommon to fail to resolve foreign transactions
due to network issue, especially in an unreliable network environment
for example geo-distributed database. So I think it will end up
requiring the client to check if preceding distributed transactions
are completed or not in order to see the results of these
transactions.

We could retry the foreign transaction resolution before leaving it to
the resolver process but the problem that the core continues trying to
resolve foreign transactions without neither transaction aborting and
rethrowing even after an error still remains.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Tue, Oct 6, 2020 at 7:22 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 2 Oct 2020 at 18:20, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > You proposed the first idea
> > > to avoid such a situation that FDW implementor can write the code
> > > while trying to reduce the possibility of errors happening as much as
> > > possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
> > > hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
> > > solution. They might miss, not know it, or use other functions
> > > provided by the core that could lead an error.
> >
> > We can give the guideline in the manual, can't we?  It should not be especially difficult for the FDW implementor
comparedto other Postgres's extensibility features that have their own rules -- table/index AM, user-defined C
function,trigger function in C, user-defined data types, hooks, etc.  And, the Postgres functions that the FDW
implementorwould use to implement their commit will be very limited, won't they?  Because most of the commit processing
isperformed in the resource manager's library (e.g. Oracle and MySQL client library.) 
>
> Yeah, if we think FDW implementors properly implement these APIs while
> following the guideline, giving the guideline is a good idea. But I’m
> not sure all FDW implementors are able to do that and even if the user
> uses an FDW whose transaction APIs don’t follow the guideline, the
> user won’t realize it. IMO it’s better to design the feature while not
> depending on external programs for reliability (correctness?) of this
> feature, although I might be too worried.
>

+1 for that. I don't think it's even in the hands of implementers to
avoid throwing an error in all the conditions.

--
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Oct 6, 2020 at 10:52 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Fri, 2 Oct 2020 at 18:20, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > You proposed the first idea
> > > to avoid such a situation that FDW implementor can write the code
> > > while trying to reduce the possibility of errors happening as much as
> > > possible, for example by usingpalloc_extended(MCXT_ALLOC_NO_OOM) and
> > > hash_search(HASH_ENTER_NULL) but I think it's not a comprehensive
> > > solution. They might miss, not know it, or use other functions
> > > provided by the core that could lead an error.
> >
> > We can give the guideline in the manual, can't we?  It should not be especially difficult for the FDW implementor
comparedto other Postgres's extensibility features that have their own rules -- table/index AM, user-defined C
function,trigger function in C, user-defined data types, hooks, etc.  And, the Postgres functions that the FDW
implementorwould use to implement their commit will be very limited, won't they?  Because most of the commit processing
isperformed in the resource manager's library (e.g. Oracle and MySQL client library.) 
>
> Yeah, if we think FDW implementors properly implement these APIs while
> following the guideline, giving the guideline is a good idea. But I’m
> not sure all FDW implementors are able to do that and even if the user
> uses an FDW whose transaction APIs don’t follow the guideline, the
> user won’t realize it. IMO it’s better to design the feature while not
> depending on external programs for reliability (correctness?) of this
> feature, although I might be too worried.
>

After more thoughts on Tsunakawa-san’s idea it seems to need the
following conditions:

* At least postgres_fdw is viable to implement these APIs while
guaranteeing not to happen any error.
* A certain number of FDWs (or majority of FDWs) can do that in a
similar way to postgres_fdw by using the guideline and probably
postgres_fdw as a reference.

These are necessary for FDW implementors to implement APIs while
following the guideline and for the core to trust them.

As far as postgres_fdw goes, what we need to do when committing a
foreign transaction resolution is to get a connection from the
connection cache or create and connect if not found, construct a SQL
query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size
buffer, send the query, and get the result. The possible place to
raise an error is limited. In case of failures such as connection
error FDW can return false to the core along with a flag indicating to
ask the core retry. Then the core will retry to resolve foreign
transactions after some sleep. OTOH if FDW sized up that there is no
hope of resolving the foreign transaction, it also could return false
to the core along with another flag indicating to remove the entry and
not to retry. Also, the transaction resolution by FDW needs to be
cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS().

Probably, as Tsunakawa-san also suggested, it’s not impossible to
implement these APIs in postgres_fdw while guaranteeing not to happen
any error, although not sure the code complexity. So I think the first
condition may be true but not sure about the second assumption,
particularly about the interruptible part.

I thought we could support both ideas to get their pros; supporting
Tsunakawa-san's idea and then my idea if necessary, and FDW can choose
whether to ask the resolver process to perform 2nd phase of 2PC or
not. But it's not a good idea in terms of complexity.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
Sorry to be late to respond.  (My PC is behaving strangely after upgrading Win10 2004)

From: Masahiko Sawada <sawada.mshk@gmail.com>
> After more thoughts on Tsunakawa-san’s idea it seems to need the
> following conditions:
> 
> * At least postgres_fdw is viable to implement these APIs while
> guaranteeing not to happen any error.
> * A certain number of FDWs (or majority of FDWs) can do that in a
> similar way to postgres_fdw by using the guideline and probably
> postgres_fdw as a reference.
> 
> These are necessary for FDW implementors to implement APIs while
> following the guideline and for the core to trust them.
> 
> As far as postgres_fdw goes, what we need to do when committing a
> foreign transaction resolution is to get a connection from the
> connection cache or create and connect if not found, construct a SQL
> query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size
> buffer, send the query, and get the result. The possible place to
> raise an error is limited. In case of failures such as connection
> error FDW can return false to the core along with a flag indicating to
> ask the core retry. Then the core will retry to resolve foreign
> transactions after some sleep. OTOH if FDW sized up that there is no
> hope of resolving the foreign transaction, it also could return false
> to the core along with another flag indicating to remove the entry and
> not to retry. Also, the transaction resolution by FDW needs to be
> cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS().
> 
> Probably, as Tsunakawa-san also suggested, it’s not impossible to
> implement these APIs in postgres_fdw while guaranteeing not to happen
> any error, although not sure the code complexity. So I think the first
> condition may be true but not sure about the second assumption,
> particularly about the interruptible part.

Yeah, I expect the commit of the second phase should not be difficult for the FDW developer.

As for the cancellation during commit retry, I don't think we necessarily have to make the TM responsible for retrying
thecommits.  Many DBMSs have their own timeout functionality such as connection timeout, socket timeout, and statement
timeout. Users can set those parameters in the foreign server options based on how long the end user can wait.  That
is,TM calls FDW's commit routine just once.
 

If the TM makes efforts to retry commits, the duration would be from a few seconds to 30 seconds.  Then, we can hold
backthe cancellation during that period.
 


> I thought we could support both ideas to get their pros; supporting
> Tsunakawa-san's idea and then my idea if necessary, and FDW can choose
> whether to ask the resolver process to perform 2nd phase of 2PC or
> not. But it's not a good idea in terms of complexity.

I don't feel the need for leaving the commit to the resolver during normal operation.


 seems like if failed to resolve, the backend would return an
> acknowledgment of COMMIT to the client and the resolver process
> resolves foreign prepared transactions in the background. So we can
> ensure that the distributed transaction is completed at the time when
> the client got an acknowledgment of COMMIT if 2nd phase of 2PC is
> successfully completed in the first attempts. OTOH, if it failed for
> whatever reason, there is no such guarantee. From an optimistic
> perspective, i.g., the failures are unlikely to happen, it will work
> well but IMO it’s not uncommon to fail to resolve foreign transactions
> due to network issue, especially in an unreliable network environment
> for example geo-distributed database. So I think it will end up
> requiring the client to check if preceding distributed transactions
> are completed or not in order to see the results of these
> transactions.

That issue exists with any method, doesn't it?


 Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, 8 Oct 2020 at 18:05, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> Sorry to be late to respond.  (My PC is behaving strangely after upgrading Win10 2004)
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > After more thoughts on Tsunakawa-san’s idea it seems to need the
> > following conditions:
> >
> > * At least postgres_fdw is viable to implement these APIs while
> > guaranteeing not to happen any error.
> > * A certain number of FDWs (or majority of FDWs) can do that in a
> > similar way to postgres_fdw by using the guideline and probably
> > postgres_fdw as a reference.
> >
> > These are necessary for FDW implementors to implement APIs while
> > following the guideline and for the core to trust them.
> >
> > As far as postgres_fdw goes, what we need to do when committing a
> > foreign transaction resolution is to get a connection from the
> > connection cache or create and connect if not found, construct a SQL
> > query (COMMIT/ROLLBACK PREPARED with identifier) using a fixed-size
> > buffer, send the query, and get the result. The possible place to
> > raise an error is limited. In case of failures such as connection
> > error FDW can return false to the core along with a flag indicating to
> > ask the core retry. Then the core will retry to resolve foreign
> > transactions after some sleep. OTOH if FDW sized up that there is no
> > hope of resolving the foreign transaction, it also could return false
> > to the core along with another flag indicating to remove the entry and
> > not to retry. Also, the transaction resolution by FDW needs to be
> > cancellable (interruptible) but cannot use CHECK_FOR_INTERRUPTS().
> >
> > Probably, as Tsunakawa-san also suggested, it’s not impossible to
> > implement these APIs in postgres_fdw while guaranteeing not to happen
> > any error, although not sure the code complexity. So I think the first
> > condition may be true but not sure about the second assumption,
> > particularly about the interruptible part.
>
> Yeah, I expect the commit of the second phase should not be difficult for the FDW developer.
>
> As for the cancellation during commit retry, I don't think we necessarily have to make the TM responsible for
retryingthe commits.  Many DBMSs have their own timeout functionality such as connection timeout, socket timeout, and
statementtimeout. 
> Users can set those parameters in the foreign server options based on how long the end user can wait.  That is, TM
callsFDW's commit routine just once. 

What about temporary network failures? I think there are users who
don't want to give up resolving foreign transactions failed due to a
temporary network failure. Or even they might want to wait for
transaction completion until they send a cancel request. If we want to
call the commit routine only once and therefore want FDW to retry
connecting the foreign server within the call, it means we require all
FDW implementors to write a retry loop code that is interruptible and
ensures not to raise an error, which increases difficulty.

Also, what if the user sets the statement timeout to 60 sec and they
want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
that client libraries of other DBMSs don't have asynchronous execution
functionality. If the SQL execution function is not interruptible, the
user will end up waiting for 60 sec, which seems not good.

> If the TM makes efforts to retry commits, the duration would be from a few seconds to 30 seconds.  Then, we can hold
backthe cancellation during that period. 
>
>
> > I thought we could support both ideas to get their pros; supporting
> > Tsunakawa-san's idea and then my idea if necessary, and FDW can choose
> > whether to ask the resolver process to perform 2nd phase of 2PC or
> > not. But it's not a good idea in terms of complexity.
>
> I don't feel the need for leaving the commit to the resolver during normal operation.

I meant it's for FDWs that cannot guarantee not to happen error during
resolution.

>  seems like if failed to resolve, the backend would return an
> > acknowledgment of COMMIT to the client and the resolver process
> > resolves foreign prepared transactions in the background. So we can
> > ensure that the distributed transaction is completed at the time when
> > the client got an acknowledgment of COMMIT if 2nd phase of 2PC is
> > successfully completed in the first attempts. OTOH, if it failed for
> > whatever reason, there is no such guarantee. From an optimistic
> > perspective, i.g., the failures are unlikely to happen, it will work
> > well but IMO it’s not uncommon to fail to resolve foreign transactions
> > due to network issue, especially in an unreliable network environment
> > for example geo-distributed database. So I think it will end up
> > requiring the client to check if preceding distributed transactions
> > are completed or not in order to see the results of these
> > transactions.
>
> That issue exists with any method, doesn't it?

Yes, but if we don’t retry to resolve foreign transactions at all on
an unreliable network environment, the user might end up requiring
every transaction to check the status of foreign transactions of the
previous distributed transaction before starts. If we allow to do
retry, I guess we ease that somewhat.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> What about temporary network failures? I think there are users who
> don't want to give up resolving foreign transactions failed due to a
> temporary network failure. Or even they might want to wait for
> transaction completion until they send a cancel request. If we want to
> call the commit routine only once and therefore want FDW to retry
> connecting the foreign server within the call, it means we require all
> FDW implementors to write a retry loop code that is interruptible and
> ensures not to raise an error, which increases difficulty.
>
> Yes, but if we don’t retry to resolve foreign transactions at all on
> an unreliable network environment, the user might end up requiring
> every transaction to check the status of foreign transactions of the
> previous distributed transaction before starts. If we allow to do
> retry, I guess we ease that somewhat.

OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's mandatory.  If
thenetwork failure is really temporary and thus recovers soon, then the resolver will be able to commit the transaction
soon,too.
 

Then, we can have a commit retry timeout or retry count like the following WebLogic manual says.  (I couldn't quickly
findthe English manual, so below is in Japanese.  I quoted some text that got through machine translation, which
appearsa bit strange.)
 

https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
--------------------------------------------------
Abandon timeout
Specifies the maximum time (in seconds) that the transaction manager attempts to complete the second phase of a
two-phasecommit transaction.
 

In the second phase of a two-phase commit transaction, the transaction manager attempts to complete the transaction
untilall resource managers indicate that the transaction is complete. After the abort transaction timer expires, no
attemptis made to resolve the transaction. If the transaction enters a ready state before it is destroyed, the
transactionmanager rolls back the transaction and releases the held lock on behalf of the destroyed transaction.
 
--------------------------------------------------



> Also, what if the user sets the statement timeout to 60 sec and they
> want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> that client libraries of other DBMSs don't have asynchronous execution
> functionality. If the SQL execution function is not interruptible, the
> user will end up waiting for 60 sec, which seems not good.

FDW functions can be uninterruptible in general, aren't they?  We experienced that odbc_fdw didn't allow cancellation
ofSQL execution.
 


 Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > What about temporary network failures? I think there are users who
> > don't want to give up resolving foreign transactions failed due to a
> > temporary network failure. Or even they might want to wait for
> > transaction completion until they send a cancel request. If we want to
> > call the commit routine only once and therefore want FDW to retry
> > connecting the foreign server within the call, it means we require all
> > FDW implementors to write a retry loop code that is interruptible and
> > ensures not to raise an error, which increases difficulty.
> >
> > Yes, but if we don’t retry to resolve foreign transactions at all on
> > an unreliable network environment, the user might end up requiring
> > every transaction to check the status of foreign transactions of the
> > previous distributed transaction before starts. If we allow to do
> > retry, I guess we ease that somewhat.
> 
> OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's mandatory.
Ifthe network failure is really temporary and thus recovers soon, then the resolver will be able to commit the
transactionsoon, too.
 

I should missing something, though...

I don't understand why we hate ERRORs from fdw-2pc-commit routine so
much. I think remote-commits should be performed before local commit
passes the point-of-no-return and the v26-0002 actually places
AtEOXact_FdwXact() before the critical section.

(FWIW, I think remote commits should be performed by backends, not by
another process, because backends should wait for all remote-commits
to end anyway and it is simpler. If we want to multiple remote-commits
in parallel, we could do that by adding some async-waiting interface.)

> Then, we can have a commit retry timeout or retry count like the following WebLogic manual says.  (I couldn't quickly
findthe English manual, so below is in Japanese.  I quoted some text that got through machine translation, which
appearsa bit strange.)
 
> 
> https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> --------------------------------------------------
> Abandon timeout
> Specifies the maximum time (in seconds) that the transaction manager attempts to complete the second phase of a
two-phasecommit transaction.
 
> 
> In the second phase of a two-phase commit transaction, the transaction manager attempts to complete the transaction
untilall resource managers indicate that the transaction is complete. After the abort transaction timer expires, no
attemptis made to resolve the transaction. If the transaction enters a ready state before it is destroyed, the
transactionmanager rolls back the transaction and releases the held lock on behalf of the destroyed transaction.
 
> --------------------------------------------------

That's not a retry timeout but a timeout for total time of all
2nd-phase-commits.  But I think it would be sufficient.  Even if an
fdw could retry 2pc-commit, it's a matter of that fdw and the core has
nothing to do with.

> > Also, what if the user sets the statement timeout to 60 sec and they
> > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> > that client libraries of other DBMSs don't have asynchronous execution
> > functionality. If the SQL execution function is not interruptible, the
> > user will end up waiting for 60 sec, which seems not good.

I think fdw-2pc-commit can be interruptible safely as far as we run
the remote commits before entring critical section of local commit.

> FDW functions can be uninterruptible in general, aren't they?  We experienced that odbc_fdw didn't allow cancellation
ofSQL execution.
 

At least postgres_fdw is interruptible while waiting the remote.

create view lt as select 1 as slp from (select pg_sleep(10)) t;
create foreign table ft(slp int) server sv1 options (table_name 'lt');
select * from ft;
^CCancel request sent
ERROR:  canceling statement due to user request

regrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> much. I think remote-commits should be performed before local commit
> passes the point-of-no-return and the v26-0002 actually places
> AtEOXact_FdwXact() before the critical section.

I don't hate ERROR, but it would be simpler and understandable for the FDW commit routine to just return control to the
caller(TM) and let TM do whatever is appropriate (asks the resolver to handle the failed commit, and continues to
requestnext FDW to commit.)
 


> > https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> > --------------------------------------------------
> > Abandon timeout
> > Specifies the maximum time (in seconds) that the transaction manager
> attempts to complete the second phase of a two-phase commit transaction.
> >
> > In the second phase of a two-phase commit transaction, the transaction
> manager attempts to complete the transaction until all resource managers
> indicate that the transaction is complete. After the abort transaction timer
> expires, no attempt is made to resolve the transaction. If the transaction enters
> a ready state before it is destroyed, the transaction manager rolls back the
> transaction and releases the held lock on behalf of the destroyed transaction.
> > --------------------------------------------------
> 
> That's not a retry timeout but a timeout for total time of all
> 2nd-phase-commits.  But I think it would be sufficient.  Even if an
> fdw could retry 2pc-commit, it's a matter of that fdw and the core has
> nothing to do with.

Yeah, the WebLogic documentation doesn't say whether it performs retries during the timeout period.  I just cited as an
examplethat has a timeout parameter for the second phase of 2PC.
 


> At least postgres_fdw is interruptible while waiting the remote.
> 
> create view lt as select 1 as slp from (select pg_sleep(10)) t;
> create foreign table ft(slp int) server sv1 options (table_name 'lt');
> select * from ft;
> ^CCancel request sent
> ERROR:  canceling statement due to user request

I'm afraid the cancellation doesn't work while postgres_fdw is trying to connect to a down server.  Also, Postgres
manualdoesn't say about cancellation, so we cannot expect FDWs to respond to user's cancel request.
 


 Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 9 Oct 2020 at 11:33, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > What about temporary network failures? I think there are users who
> > don't want to give up resolving foreign transactions failed due to a
> > temporary network failure. Or even they might want to wait for
> > transaction completion until they send a cancel request. If we want to
> > call the commit routine only once and therefore want FDW to retry
> > connecting the foreign server within the call, it means we require all
> > FDW implementors to write a retry loop code that is interruptible and
> > ensures not to raise an error, which increases difficulty.
> >
> > Yes, but if we don’t retry to resolve foreign transactions at all on
> > an unreliable network environment, the user might end up requiring
> > every transaction to check the status of foreign transactions of the
> > previous distributed transaction before starts. If we allow to do
> > retry, I guess we ease that somewhat.
>
> OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's mandatory.
Ifthe network failure is really temporary and thus recovers soon, then the resolver will be able to commit the
transactionsoon, too. 

Well, I agree that it's not mandatory. I think it's better if the user
can choose.

I also doubt how useful the per-foreign-server timeout setting you
mentioned before. For example, suppose the transaction involves with
three foreign servers that have different timeout setting, what if the
backend failed to commit on the first one of the server due to
timeout? Does it attempt to commit on the other two servers? Or does
it give up and return the control to the client? In the former case,
what if the backend failed again on one of the other two servers due
to timeout? The backend might end up waiting for all timeouts and in
practice the user is not aware of how many servers are involved with
the transaction, for example in a sharding. So It seems to be hard to
predict the total timeout. In the latter case, the backend might
succeed to commit on the other two nodes. Also, the timeout setting of
the first foreign server virtually is used as the whole foreign
transaction resolution timeout. However, the user cannot control the
order of resolution. So again it seems to be hard for the user to
predict the timeout. So If we have a timeout mechanism, I think it's
better if the user can control the timeout for each transaction.
Probably the same is true for the retry.

>
> Then, we can have a commit retry timeout or retry count like the following WebLogic manual says.  (I couldn't quickly
findthe English manual, so below is in Japanese.  I quoted some text that got through machine translation, which
appearsa bit strange.) 
>
> https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> --------------------------------------------------
> Abandon timeout
> Specifies the maximum time (in seconds) that the transaction manager attempts to complete the second phase of a
two-phasecommit transaction. 
>
> In the second phase of a two-phase commit transaction, the transaction manager attempts to complete the transaction
untilall resource managers indicate that the transaction is complete. After the abort transaction timer expires, no
attemptis made to resolve the transaction. If the transaction enters a ready state before it is destroyed, the
transactionmanager rolls back the transaction and releases the held lock on behalf of the destroyed transaction. 
> --------------------------------------------------

Yeah per-transaction timeout for 2nd phase of 2PC seems a good idea.

>
>
>
> > Also, what if the user sets the statement timeout to 60 sec and they
> > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> > that client libraries of other DBMSs don't have asynchronous execution
> > functionality. If the SQL execution function is not interruptible, the
> > user will end up waiting for 60 sec, which seems not good.
>
> FDW functions can be uninterruptible in general, aren't they?  We experienced that odbc_fdw didn't allow cancellation
ofSQL execution. 

For example in postgres_fdw, it executes a SQL in asynchronous manner
using by PQsendQuery(), PQconsumeInput() and PQgetResult() and so on
(see do_sql_command() and pgfdw_get_result()). Therefore it the user
pressed ctl-C, the remote query would be canceled and raise an ERROR.


Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > What about temporary network failures? I think there are users who
> > > don't want to give up resolving foreign transactions failed due to a
> > > temporary network failure. Or even they might want to wait for
> > > transaction completion until they send a cancel request. If we want to
> > > call the commit routine only once and therefore want FDW to retry
> > > connecting the foreign server within the call, it means we require all
> > > FDW implementors to write a retry loop code that is interruptible and
> > > ensures not to raise an error, which increases difficulty.
> > >
> > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > an unreliable network environment, the user might end up requiring
> > > every transaction to check the status of foreign transactions of the
> > > previous distributed transaction before starts. If we allow to do
> > > retry, I guess we ease that somewhat.
> >
> > OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's mandatory.
Ifthe network failure is really temporary and thus recovers soon, then the resolver will be able to commit the
transactionsoon, too. 
>
> I should missing something, though...
>
> I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> much. I think remote-commits should be performed before local commit
> passes the point-of-no-return and the v26-0002 actually places
> AtEOXact_FdwXact() before the critical section.
>

So you're thinking the following sequence?

1. Prepare all foreign transactions.
2. Commit the all prepared foreign transactions.
3. Commit the local transaction.

Suppose we have the backend process call the commit routine, what if
one of FDW raises an ERROR during committing the foreign transaction
after committing other foreign transactions? The transaction will end
up with an abort but some foreign transactions are already committed.
Also, what if the backend process failed to commit the local
transaction? Since it already committed all foreign transactions it
cannot ensure the global atomicity in this case too. Therefore, I
think we should commit the distributed transactions in the following
sequence:

1. Prepare all foreign transactions.
2. Commit the local transaction.
3. Commit the all prepared foreign transactions.

But this is still not a perfect solution. If we have the backend
process call the commit routine and an error happens during executing
the commit routine of an FDW (i.g., at step 3) it's too late to report
an error to the client because we already committed the local
transaction. So the current solution is to have a background process
commit the foreign transactions so that the backend can just wait
without the possibility of errors.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I also doubt how useful the per-foreign-server timeout setting you
> mentioned before. For example, suppose the transaction involves with
> three foreign servers that have different timeout setting, what if the
> backend failed to commit on the first one of the server due to
> timeout? Does it attempt to commit on the other two servers? Or does
> it give up and return the control to the client? In the former case,
> what if the backend failed again on one of the other two servers due
> to timeout? The backend might end up waiting for all timeouts and in
> practice the user is not aware of how many servers are involved with
> the transaction, for example in a sharding. So It seems to be hard to
> predict the total timeout. In the latter case, the backend might
> succeed to commit on the other two nodes. Also, the timeout setting of
> the first foreign server virtually is used as the whole foreign
> transaction resolution timeout. However, the user cannot control the
> order of resolution. So again it seems to be hard for the user to
> predict the timeout. So If we have a timeout mechanism, I think it's
> better if the user can control the timeout for each transaction.
> Probably the same is true for the retry.

I agree that the user can control the timeout per transaction, not per FDW.  I was just not sure if the Postgres core
candefine the timeout parameter and the FDWs can follow its setting.  However, JTA defines a transaction timeout API
(notcommit timeout, though), and each RM can choose to implement them.  So I think we can define the parameter and/or
routinesfor the timeout in core likewise.
 


--------------------------------------------------
public interface javax.transaction.xa.XAResource 

int getTransactionTimeout() throws XAException 
This method returns the transaction timeout value set for this XAResourceinstance. If XAResource.
setTransactionTimeout was not use prior to invoking this method, the return value is the 
default timeout set for the resource manager; otherwise, the value used in the previous setTransactionTimeoutcall 
is returned. 

Throws: XAException 
An error has occurred. Possible exception values are: XAER_RMERR, XAER_RMFAIL. 

Returns: 
The transaction timeout values in seconds. 

boolean setTransactionTimeout(int seconds) throws XAException 
This method sets the transaction timeout value for this XAResourceinstance. Once set, this timeout value 
is effective until setTransactionTimeoutis invoked again with a different value. To reset the timeout 
value to the default value used by the resource manager, set the value to zero. 

If the timeout operation is performed successfully, the method returns true; otherwise false. If a resource 
manager does not support transaction timeout value to be set explicitly, this method returns false. 

Parameters:

 seconds 
An positive integer specifying the timout value in seconds. Zero resets the transaction timeout 
value to the default one used by the resource manager. A negative value results in XAException 
to be thrown with XAER_INVAL error code. 

Returns: 
true if transaction timeout value is set successfully; otherwise false. 

Throws: XAException 
An error has occurred. Possible exception values are: XAER_RMERR, XAER_RMFAIL, or 
XAER_INVAL. 
--------------------------------------------------



> For example in postgres_fdw, it executes a SQL in asynchronous manner
> using by PQsendQuery(), PQconsumeInput() and PQgetResult() and so on
> (see do_sql_command() and pgfdw_get_result()). Therefore it the user
> pressed ctl-C, the remote query would be canceled and raise an ERROR.

Yeah, as I replied to Horiguchi-san, postgres_fdw can cancel queries.  But postgres_fdw is not ready to cancel
connectionestablishment, is it?  At present, the user needs to set connect_timeout parameter on the foreign server to a
reasonableshort time so that it can respond quickly to cancellation requests.  Alternately, we can modify postgres_fdw
touse libpq's asynchronous connect functions.
 

Another issue is that the Postgres manual does not stipulate anything about cancellation of FDW processing.  That's why
Isaid that the current FDW does not support cancellation in general.  Of course, I think we can stipulate the ability
tocancel processing in the FDW interface.
 


 Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 12 Oct 2020 at 11:08, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > I also doubt how useful the per-foreign-server timeout setting you
> > mentioned before. For example, suppose the transaction involves with
> > three foreign servers that have different timeout setting, what if the
> > backend failed to commit on the first one of the server due to
> > timeout? Does it attempt to commit on the other two servers? Or does
> > it give up and return the control to the client? In the former case,
> > what if the backend failed again on one of the other two servers due
> > to timeout? The backend might end up waiting for all timeouts and in
> > practice the user is not aware of how many servers are involved with
> > the transaction, for example in a sharding. So It seems to be hard to
> > predict the total timeout. In the latter case, the backend might
> > succeed to commit on the other two nodes. Also, the timeout setting of
> > the first foreign server virtually is used as the whole foreign
> > transaction resolution timeout. However, the user cannot control the
> > order of resolution. So again it seems to be hard for the user to
> > predict the timeout. So If we have a timeout mechanism, I think it's
> > better if the user can control the timeout for each transaction.
> > Probably the same is true for the retry.
>
> I agree that the user can control the timeout per transaction, not per FDW.  I was just not sure if the Postgres core
candefine the timeout parameter and the FDWs can follow its setting.  However, JTA defines a transaction timeout API
(notcommit timeout, though), and each RM can choose to implement them.  So I think we can define the parameter and/or
routinesfor the timeout in core likewise. 

I was thinking to have a GUC timeout parameter like statement_timeout.
The backend waits for the setting value when resolving foreign
transactions. But this idea seems different. FDW can set its timeout
via a transaction timeout API, is that right? But even if FDW can set
the timeout using a transaction timeout API, the problem that client
libraries for some DBMS don't support interruptible functions still
remains. The user can set a short time to the timeout but it also
leads to unnecessary timeouts. Thoughts?

>
>
> --------------------------------------------------
> public interface javax.transaction.xa.XAResource
>
> int getTransactionTimeout() throws XAException
> This method returns the transaction timeout value set for this XAResourceinstance. If XAResource.
> setTransactionTimeout was not use prior to invoking this method, the return value is the
> default timeout set for the resource manager; otherwise, the value used in the previous setTransactionTimeoutcall
> is returned.
>
> Throws: XAException
> An error has occurred. Possible exception values are: XAER_RMERR, XAER_RMFAIL.
>
> Returns:
> The transaction timeout values in seconds.
>
> boolean setTransactionTimeout(int seconds) throws XAException
> This method sets the transaction timeout value for this XAResourceinstance. Once set, this timeout value
> is effective until setTransactionTimeoutis invoked again with a different value. To reset the timeout
> value to the default value used by the resource manager, set the value to zero.
>
> If the timeout operation is performed successfully, the method returns true; otherwise false. If a resource
> manager does not support transaction timeout value to be set explicitly, this method returns false.
>
> Parameters:
>
>  seconds
> An positive integer specifying the timout value in seconds. Zero resets the transaction timeout
> value to the default one used by the resource manager. A negative value results in XAException
> to be thrown with XAER_INVAL error code.
>
> Returns:
> true if transaction timeout value is set successfully; otherwise false.
>
> Throws: XAException
> An error has occurred. Possible exception values are: XAER_RMERR, XAER_RMFAIL, or
> XAER_INVAL.
> --------------------------------------------------
>
>
>
> > For example in postgres_fdw, it executes a SQL in asynchronous manner
> > using by PQsendQuery(), PQconsumeInput() and PQgetResult() and so on
> > (see do_sql_command() and pgfdw_get_result()). Therefore it the user
> > pressed ctl-C, the remote query would be canceled and raise an ERROR.
>
> Yeah, as I replied to Horiguchi-san, postgres_fdw can cancel queries.  But postgres_fdw is not ready to cancel
connectionestablishment, is it?  At present, the user needs to set connect_timeout parameter on the foreign server to a
reasonableshort time so that it can respond quickly to cancellation requests.  Alternately, we can modify postgres_fdw
touse libpq's asynchronous connect functions. 

Yes, I think using asynchronous connect functions seems a good idea.

> Another issue is that the Postgres manual does not stipulate anything about cancellation of FDW processing.  That's
whyI said that the current FDW does not support cancellation in general.  Of course, I think we can stipulate the
abilityto cancel processing in the FDW interface. 

Yeah, it's the FDW developer responsibility to write the code to
execute the remote SQL that is interruptible. +1 for adding that to
the doc.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> I was thinking to have a GUC timeout parameter like statement_timeout.
> The backend waits for the setting value when resolving foreign
> transactions.

Me too.


> But this idea seems different. FDW can set its timeout
> via a transaction timeout API, is that right?

I'm not perfectly sure about how the TM( application server works) , but probably no.  The TM has a configuration
parameterfor transaction timeout, and the TM calls XAResource.setTransactionTimeout() with that or smaller value for
theargument.
 


> But even if FDW can set
> the timeout using a transaction timeout API, the problem that client
> libraries for some DBMS don't support interruptible functions still
> remains. The user can set a short time to the timeout but it also
> leads to unnecessary timeouts. Thoughts?

Unfortunately, I'm afraid we can do nothing about it.  If the DBMS's client library doesn't support cancellation (e.g.
doesn'trespond to Ctrl+C or provide a function that cancel processing in pgorogss), then the Postgres user just finds
thathe can't cancel queries (just like we experienced with odbc_fdw.)
 


 Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > What about temporary network failures? I think there are users who
> > > > don't want to give up resolving foreign transactions failed due to a
> > > > temporary network failure. Or even they might want to wait for
> > > > transaction completion until they send a cancel request. If we want to
> > > > call the commit routine only once and therefore want FDW to retry
> > > > connecting the foreign server within the call, it means we require all
> > > > FDW implementors to write a retry loop code that is interruptible and
> > > > ensures not to raise an error, which increases difficulty.
> > > >
> > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > an unreliable network environment, the user might end up requiring
> > > > every transaction to check the status of foreign transactions of the
> > > > previous distributed transaction before starts. If we allow to do
> > > > retry, I guess we ease that somewhat.
> > >
> > > OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's
mandatory. If the network failure is really temporary and thus recovers soon, then the resolver will be able to commit
thetransaction soon, too.
 
> >
> > I should missing something, though...
> >
> > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > much. I think remote-commits should be performed before local commit
> > passes the point-of-no-return and the v26-0002 actually places
> > AtEOXact_FdwXact() before the critical section.
> >
> 
> So you're thinking the following sequence?
> 
> 1. Prepare all foreign transactions.
> 2. Commit the all prepared foreign transactions.
> 3. Commit the local transaction.
> 
> Suppose we have the backend process call the commit routine, what if
> one of FDW raises an ERROR during committing the foreign transaction
> after committing other foreign transactions? The transaction will end
> up with an abort but some foreign transactions are already committed.

Ok, I understand what you are aiming.

It is apparently out of the focus of the two-phase commit
protocol. Each FDW server can try to keep the contract as far as its
ability reaches, but in the end such kind of failure is
inevitable. Even if we require FDW developers not to respond until a
2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
even not in an extremely bad case.

We have no other choices than shutting the server down (then the
succeeding server start removes the garbage commits) or continueing
working leaving some information in a system storage (or reverting the
garbage commits). What we can do in that case is to provide a
automated way to resolve the inconsistency.

> Also, what if the backend process failed to commit the local
> transaction? Since it already committed all foreign transactions it
> cannot ensure the global atomicity in this case too. Therefore, I
> think we should commit the distributed transactions in the following
> sequence:

Ditto. It's out of the range of 2pc. Using p2c for local transaction
could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
could reduce the probability but can't elimite failure cases.

> 1. Prepare all foreign transactions.
> 2. Commit the local transaction.
> 3. Commit the all prepared foreign transactions.
> 
> But this is still not a perfect solution. If we have the backend

2pc is not a perfect solution in the first place. Attaching a similar
phase to it cannot make it "perfect".

> process call the commit routine and an error happens during executing
> the commit routine of an FDW (i.g., at step 3) it's too late to report
> an error to the client because we already committed the local
> transaction. So the current solution is to have a background process
> commit the foreign transactions so that the backend can just wait
> without the possibility of errors.

Whatever process tries to complete a transaction, the client must wait
for the transaction to end and anyway that's just a freeze in the
client's view, unless you intended to respond to local commit before
all participant complete.

I don't think most of client applications wouldn't wait for frozen
server forever.  We have the same issue at the time the client decided
to give up the transacton, or the leader session is killed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 13 Oct 2020 at 10:00, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > > What about temporary network failures? I think there are users who
> > > > > don't want to give up resolving foreign transactions failed due to a
> > > > > temporary network failure. Or even they might want to wait for
> > > > > transaction completion until they send a cancel request. If we want to
> > > > > call the commit routine only once and therefore want FDW to retry
> > > > > connecting the foreign server within the call, it means we require all
> > > > > FDW implementors to write a retry loop code that is interruptible and
> > > > > ensures not to raise an error, which increases difficulty.
> > > > >
> > > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > > an unreliable network environment, the user might end up requiring
> > > > > every transaction to check the status of foreign transactions of the
> > > > > previous distributed transaction before starts. If we allow to do
> > > > > retry, I guess we ease that somewhat.
> > > >
> > > > OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's
mandatory. If the network failure is really temporary and thus recovers soon, then the resolver will be able to commit
thetransaction soon, too. 
> > >
> > > I should missing something, though...
> > >
> > > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > > much. I think remote-commits should be performed before local commit
> > > passes the point-of-no-return and the v26-0002 actually places
> > > AtEOXact_FdwXact() before the critical section.
> > >
> >
> > So you're thinking the following sequence?
> >
> > 1. Prepare all foreign transactions.
> > 2. Commit the all prepared foreign transactions.
> > 3. Commit the local transaction.
> >
> > Suppose we have the backend process call the commit routine, what if
> > one of FDW raises an ERROR during committing the foreign transaction
> > after committing other foreign transactions? The transaction will end
> > up with an abort but some foreign transactions are already committed.
>
> Ok, I understand what you are aiming.
>
> It is apparently out of the focus of the two-phase commit
> protocol. Each FDW server can try to keep the contract as far as its
> ability reaches, but in the end such kind of failure is
> inevitable. Even if we require FDW developers not to respond until a
> 2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
> even not in an extremely bad case.
>
> We have no other choices than shutting the server down (then the
> succeeding server start removes the garbage commits) or continueing
> working leaving some information in a system storage (or reverting the
> garbage commits). What we can do in that case is to provide a
> automated way to resolve the inconsistency.
>
> > Also, what if the backend process failed to commit the local
> > transaction? Since it already committed all foreign transactions it
> > cannot ensure the global atomicity in this case too. Therefore, I
> > think we should commit the distributed transactions in the following
> > sequence:
>
> Ditto. It's out of the range of 2pc. Using p2c for local transaction
> could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
> could reduce the probability but can't elimite failure cases.

IMO the problems I mentioned arise from the fact that the above
sequence doesn't really follow the 2pc protocol in the first place.

We can think of the fact that we commit the local transaction without
preparation while preparing foreign transactions as that we’re using
the 2pc with last resource transaction optimization (or last agent
optimization)[1]. That is, we prepare all foreign transactions first
and the local node is always the last resource to process. At this
time, the outcome of the distributed transaction completely depends on
the fate of the last resource (i.g., the local transaction). If it
fails, the distributed transaction must be abort by rolling back
prepared foreign transactions. OTOH, if it succeeds, all prepared
foreign transaction must be committed. Therefore, we don’t need to
prepare the last resource and can commit it. In this way, if we want
to commit the local transaction without preparation, the local
transaction must be committed at last. But since the above sequence
doesn’t follow this protocol, we will have such problems. I think if
we follow the 2pc properly, such basic failures don't happen.

>
> > 1. Prepare all foreign transactions.
> > 2. Commit the local transaction.
> > 3. Commit the all prepared foreign transactions.
> >
> > But this is still not a perfect solution. If we have the backend
>
> 2pc is not a perfect solution in the first place. Attaching a similar
> phase to it cannot make it "perfect".
>
> > process call the commit routine and an error happens during executing
> > the commit routine of an FDW (i.g., at step 3) it's too late to report
> > an error to the client because we already committed the local
> > transaction. So the current solution is to have a background process
> > commit the foreign transactions so that the backend can just wait
> > without the possibility of errors.
>
> Whatever process tries to complete a transaction, the client must wait
> for the transaction to end and anyway that's just a freeze in the
> client's view, unless you intended to respond to local commit before
> all participant complete.

Yes, but the point of using a separate process is that even if FDW
code raises an error, the client wanting for transaction resolution
doesn't get it and it's interruptible.

[1] https://docs.oracle.com/cd/E13222_01/wls/docs91/jta/llr.html

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 13 Oct 2020 11:56:51 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Tue, 13 Oct 2020 at 10:00, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > > On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > > >
> > > > At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > > > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > > > What about temporary network failures? I think there are users who
> > > > > > don't want to give up resolving foreign transactions failed due to a
> > > > > > temporary network failure. Or even they might want to wait for
> > > > > > transaction completion until they send a cancel request. If we want to
> > > > > > call the commit routine only once and therefore want FDW to retry
> > > > > > connecting the foreign server within the call, it means we require all
> > > > > > FDW implementors to write a retry loop code that is interruptible and
> > > > > > ensures not to raise an error, which increases difficulty.
> > > > > >
> > > > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > > > an unreliable network environment, the user might end up requiring
> > > > > > every transaction to check the status of foreign transactions of the
> > > > > > previous distributed transaction before starts. If we allow to do
> > > > > > retry, I guess we ease that somewhat.
> > > > >
> > > > > OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's
mandatory. If the network failure is really temporary and thus recovers soon, then the resolver will be able to commit
thetransaction soon, too.
 
> > > >
> > > > I should missing something, though...
> > > >
> > > > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > > > much. I think remote-commits should be performed before local commit
> > > > passes the point-of-no-return and the v26-0002 actually places
> > > > AtEOXact_FdwXact() before the critical section.
> > > >
> > >
> > > So you're thinking the following sequence?
> > >
> > > 1. Prepare all foreign transactions.
> > > 2. Commit the all prepared foreign transactions.
> > > 3. Commit the local transaction.
> > >
> > > Suppose we have the backend process call the commit routine, what if
> > > one of FDW raises an ERROR during committing the foreign transaction
> > > after committing other foreign transactions? The transaction will end
> > > up with an abort but some foreign transactions are already committed.
> >
> > Ok, I understand what you are aiming.
> >
> > It is apparently out of the focus of the two-phase commit
> > protocol. Each FDW server can try to keep the contract as far as its
> > ability reaches, but in the end such kind of failure is
> > inevitable. Even if we require FDW developers not to respond until a
> > 2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
> > even not in an extremely bad case.
> >
> > We have no other choices than shutting the server down (then the
> > succeeding server start removes the garbage commits) or continueing
> > working leaving some information in a system storage (or reverting the
> > garbage commits). What we can do in that case is to provide a
> > automated way to resolve the inconsistency.
> >
> > > Also, what if the backend process failed to commit the local
> > > transaction? Since it already committed all foreign transactions it
> > > cannot ensure the global atomicity in this case too. Therefore, I
> > > think we should commit the distributed transactions in the following
> > > sequence:
> >
> > Ditto. It's out of the range of 2pc. Using p2c for local transaction
> > could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
> > could reduce the probability but can't elimite failure cases.
> 
> IMO the problems I mentioned arise from the fact that the above
> sequence doesn't really follow the 2pc protocol in the first place.
> 
> We can think of the fact that we commit the local transaction without
> preparation while preparing foreign transactions as that we’re using
> the 2pc with last resource transaction optimization (or last agent
> optimization)[1]. That is, we prepare all foreign transactions first
> and the local node is always the last resource to process. At this
> time, the outcome of the distributed transaction completely depends on
> the fate of the last resource (i.g., the local transaction). If it
> fails, the distributed transaction must be abort by rolling back
> prepared foreign transactions. OTOH, if it succeeds, all prepared
> foreign transaction must be committed. Therefore, we don’t need to
> prepare the last resource and can commit it. In this way, if we want

There are cases of commit-failure of a local transaction caused by
too-many notifications or by serialization failure.

> to commit the local transaction without preparation, the local
> transaction must be committed at last. But since the above sequence
> doesn’t follow this protocol, we will have such problems. I think if
> we follow the 2pc properly, such basic failures don't happen.

True. But I haven't suggested that sequence.

> > > 1. Prepare all foreign transactions.
> > > 2. Commit the local transaction.
> > > 3. Commit the all prepared foreign transactions.
> > >
> > > But this is still not a perfect solution. If we have the backend
> >
> > 2pc is not a perfect solution in the first place. Attaching a similar
> > phase to it cannot make it "perfect".
> >
> > > process call the commit routine and an error happens during executing
> > > the commit routine of an FDW (i.g., at step 3) it's too late to report
> > > an error to the client because we already committed the local
> > > transaction. So the current solution is to have a background process
> > > commit the foreign transactions so that the backend can just wait
> > > without the possibility of errors.
> >
> > Whatever process tries to complete a transaction, the client must wait
> > for the transaction to end and anyway that's just a freeze in the
> > client's view, unless you intended to respond to local commit before
> > all participant complete.
> 
> Yes, but the point of using a separate process is that even if FDW
> code raises an error, the client wanting for transaction resolution
> doesn't get it and it's interruptible.
> 
> [1] https://docs.oracle.com/cd/E13222_01/wls/docs91/jta/llr.html

I don't get the point. If FDW-commit is called on the same process, an
error from FDW-commit outright leads to the failure of the current
commit.  Isn't "the client wanting for transaction resolution" the
client of the leader process of the 2pc-commit in the same-process
model?

I should missing something, but postgres_fdw allows query cancelation
at commit time. (But I think it is depends on timing whether the
remote commit is completed or aborted.).  Perhaps the feature was
introduced after the project started?

> commit ae9bfc5d65123aaa0d1cca9988037489760bdeae
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   Wed Jun 7 15:14:55 2017 -0400
> 
>     postgres_fdw: Allow cancellation of transaction control commands.

I thought that we are discussing on fdw-errors during the 2pc-commit
phase.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 14 Oct 2020 at 10:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Tue, 13 Oct 2020 11:56:51 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > On Tue, 13 Oct 2020 at 10:00, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > > > On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > > > >
> > > > > At Fri, 9 Oct 2020 02:33:37 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > > > > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > > > > What about temporary network failures? I think there are users who
> > > > > > > don't want to give up resolving foreign transactions failed due to a
> > > > > > > temporary network failure. Or even they might want to wait for
> > > > > > > transaction completion until they send a cancel request. If we want to
> > > > > > > call the commit routine only once and therefore want FDW to retry
> > > > > > > connecting the foreign server within the call, it means we require all
> > > > > > > FDW implementors to write a retry loop code that is interruptible and
> > > > > > > ensures not to raise an error, which increases difficulty.
> > > > > > >
> > > > > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > > > > an unreliable network environment, the user might end up requiring
> > > > > > > every transaction to check the status of foreign transactions of the
> > > > > > > previous distributed transaction before starts. If we allow to do
> > > > > > > retry, I guess we ease that somewhat.
> > > > > >
> > > > > > OK.  As I said, I'm not against trying to cope with temporary network failure.  I just don't think it's
mandatory. If the network failure is really temporary and thus recovers soon, then the resolver will be able to commit
thetransaction soon, too. 
> > > > >
> > > > > I should missing something, though...
> > > > >
> > > > > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > > > > much. I think remote-commits should be performed before local commit
> > > > > passes the point-of-no-return and the v26-0002 actually places
> > > > > AtEOXact_FdwXact() before the critical section.
> > > > >
> > > >
> > > > So you're thinking the following sequence?
> > > >
> > > > 1. Prepare all foreign transactions.
> > > > 2. Commit the all prepared foreign transactions.
> > > > 3. Commit the local transaction.
> > > >
> > > > Suppose we have the backend process call the commit routine, what if
> > > > one of FDW raises an ERROR during committing the foreign transaction
> > > > after committing other foreign transactions? The transaction will end
> > > > up with an abort but some foreign transactions are already committed.
> > >
> > > Ok, I understand what you are aiming.
> > >
> > > It is apparently out of the focus of the two-phase commit
> > > protocol. Each FDW server can try to keep the contract as far as its
> > > ability reaches, but in the end such kind of failure is
> > > inevitable. Even if we require FDW developers not to respond until a
> > > 2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
> > > even not in an extremely bad case.
> > >
> > > We have no other choices than shutting the server down (then the
> > > succeeding server start removes the garbage commits) or continueing
> > > working leaving some information in a system storage (or reverting the
> > > garbage commits). What we can do in that case is to provide a
> > > automated way to resolve the inconsistency.
> > >
> > > > Also, what if the backend process failed to commit the local
> > > > transaction? Since it already committed all foreign transactions it
> > > > cannot ensure the global atomicity in this case too. Therefore, I
> > > > think we should commit the distributed transactions in the following
> > > > sequence:
> > >
> > > Ditto. It's out of the range of 2pc. Using p2c for local transaction
> > > could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
> > > could reduce the probability but can't elimite failure cases.
> >
> > IMO the problems I mentioned arise from the fact that the above
> > sequence doesn't really follow the 2pc protocol in the first place.
> >
> > We can think of the fact that we commit the local transaction without
> > preparation while preparing foreign transactions as that we’re using
> > the 2pc with last resource transaction optimization (or last agent
> > optimization)[1]. That is, we prepare all foreign transactions first
> > and the local node is always the last resource to process. At this
> > time, the outcome of the distributed transaction completely depends on
> > the fate of the last resource (i.g., the local transaction). If it
> > fails, the distributed transaction must be abort by rolling back
> > prepared foreign transactions. OTOH, if it succeeds, all prepared
> > foreign transaction must be committed. Therefore, we don’t need to
> > prepare the last resource and can commit it. In this way, if we want
>
> There are cases of commit-failure of a local transaction caused by
> too-many notifications or by serialization failure.

Yes, even if that happens we are still able to rollback all foreign
transactions.

>
> > to commit the local transaction without preparation, the local
> > transaction must be committed at last. But since the above sequence
> > doesn’t follow this protocol, we will have such problems. I think if
> > we follow the 2pc properly, such basic failures don't happen.
>
> True. But I haven't suggested that sequence.

Okay, I might have missed your point. Could you elaborate on the idea
you mentioned before, "I think remote-commits should be performed
before local commit passes the point-of-no-return"?

>
> > > > 1. Prepare all foreign transactions.
> > > > 2. Commit the local transaction.
> > > > 3. Commit the all prepared foreign transactions.
> > > >
> > > > But this is still not a perfect solution. If we have the backend
> > >
> > > 2pc is not a perfect solution in the first place. Attaching a similar
> > > phase to it cannot make it "perfect".
> > >
> > > > process call the commit routine and an error happens during executing
> > > > the commit routine of an FDW (i.g., at step 3) it's too late to report
> > > > an error to the client because we already committed the local
> > > > transaction. So the current solution is to have a background process
> > > > commit the foreign transactions so that the backend can just wait
> > > > without the possibility of errors.
> > >
> > > Whatever process tries to complete a transaction, the client must wait
> > > for the transaction to end and anyway that's just a freeze in the
> > > client's view, unless you intended to respond to local commit before
> > > all participant complete.
> >
> > Yes, but the point of using a separate process is that even if FDW
> > code raises an error, the client wanting for transaction resolution
> > doesn't get it and it's interruptible.
> >
> > [1] https://docs.oracle.com/cd/E13222_01/wls/docs91/jta/llr.html
>
> I don't get the point. If FDW-commit is called on the same process, an
> error from FDW-commit outright leads to the failure of the current
> commit.  Isn't "the client wanting for transaction resolution" the
> client of the leader process of the 2pc-commit in the same-process
> model?
>
> I should missing something, but postgres_fdw allows query cancelation
> at commit time. (But I think it is depends on timing whether the
> remote commit is completed or aborted.).  Perhaps the feature was
> introduced after the project started?
>
> > commit ae9bfc5d65123aaa0d1cca9988037489760bdeae
> > Author: Robert Haas <rhaas@postgresql.org>
> > Date:   Wed Jun 7 15:14:55 2017 -0400
> >
> >     postgres_fdw: Allow cancellation of transaction control commands.
>
> I thought that we are discussing on fdw-errors during the 2pc-commit
> phase.
>

Yes, I'm also discussing on fdw-errors during the 2pc-commit phase
that happens after committing the local transaction.

Even if FDW-commit raises an error due to the user's cancel request or
whatever reason during committing the prepared foreign transactions,
it's too late. The client will get an error like "ERROR:  canceling
statement due to user request" and would think the transaction is
aborted but it's not true, the local transaction is already committed.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Wed, 14 Oct 2020 12:09:34 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Wed, 14 Oct 2020 at 10:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrot> > There are cases of commit-failure
ofa local transaction caused by
 
> > too-many notifications or by serialization failure.
> 
> Yes, even if that happens we are still able to rollback all foreign
> transactions.

Mmm. I'm confused. If this is about 2pc-commit-request(or prepare)
phase, we can rollback the remote transactions. But I think we're
focusing 2pc-commit phase. remote transaction that has already
2pc-committed, they can be no longer rollback'ed.

> > > to commit the local transaction without preparation, the local
> > > transaction must be committed at last. But since the above sequence
> > > doesn’t follow this protocol, we will have such problems. I think if
> > > we follow the 2pc properly, such basic failures don't happen.
> >
> > True. But I haven't suggested that sequence.
> 
> Okay, I might have missed your point. Could you elaborate on the idea
> you mentioned before, "I think remote-commits should be performed
> before local commit passes the point-of-no-return"?

It is simply the condition that we can ERROR-out from
CommitTransaction. I thought that when you say like "we cannot
ERROR-out" you meant "since that is raised to FATAL", but it seems to
me that both of you are looking another aspect.

If the aspect is "what to do complete the all-prepared p2c transaction
at all costs", I'd say "there's a fundamental limitaion".  Although
I'm not sure what you mean exactly by prohibiting errors from fdw
routines , if that meant "the API can fail, but must not raise an
exception", that policy is enforced by setting a critical
section. However, if it were "the API mustn't fail", that cannot be
realized, I believe.

> > I thought that we are discussing on fdw-errors during the 2pc-commit
> > phase.
> >
> 
> Yes, I'm also discussing on fdw-errors during the 2pc-commit phase
> that happens after committing the local transaction.
> 
> Even if FDW-commit raises an error due to the user's cancel request or
> whatever reason during committing the prepared foreign transactions,
> it's too late. The client will get an error like "ERROR:  canceling
> statement due to user request" and would think the transaction is
> aborted but it's not true, the local transaction is already committed.

By the way I found that I misread the patch. in v26-0002,
AtEOXact_FdwXact() is actually called after the
point-of-no-return. What is the reason for the place?  We can
error-out before changing the state to TRANS_COMMIT.

And if any of the remotes ended with 2pc-commit (not prepare phase)
failure, consistency of the commit is no longer guaranteed so we have
no choice other than shutting down the server, or continuing running
allowing the incosistency.  What do we want in that case?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 14 Oct 2020 at 13:19, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Wed, 14 Oct 2020 12:09:34 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > On Wed, 14 Oct 2020 at 10:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrot> > There are cases of commit-failure
ofa local transaction caused by 
> > > too-many notifications or by serialization failure.
> >
> > Yes, even if that happens we are still able to rollback all foreign
> > transactions.
>
> Mmm. I'm confused. If this is about 2pc-commit-request(or prepare)
> phase, we can rollback the remote transactions. But I think we're
> focusing 2pc-commit phase. remote transaction that has already
> 2pc-committed, they can be no longer rollback'ed.

Did you mention a failure of local commit, right? With the current
approach, we prepare all foreign transactions first and then commit
the local transaction. After committing the local transaction we
commit the prepared foreign transactions. So suppose a serialization
failure happens during committing the local transaction, we still are
able to roll back foreign transactions. The check of serialization
failure of the foreign transactions has already been done at the
prepare phase.

>
> > > > to commit the local transaction without preparation, the local
> > > > transaction must be committed at last. But since the above sequence
> > > > doesn’t follow this protocol, we will have such problems. I think if
> > > > we follow the 2pc properly, such basic failures don't happen.
> > >
> > > True. But I haven't suggested that sequence.
> >
> > Okay, I might have missed your point. Could you elaborate on the idea
> > you mentioned before, "I think remote-commits should be performed
> > before local commit passes the point-of-no-return"?
>
> It is simply the condition that we can ERROR-out from
> CommitTransaction. I thought that when you say like "we cannot
> ERROR-out" you meant "since that is raised to FATAL", but it seems to
> me that both of you are looking another aspect.
>
> If the aspect is "what to do complete the all-prepared p2c transaction
> at all costs", I'd say "there's a fundamental limitaion".  Although
> I'm not sure what you mean exactly by prohibiting errors from fdw
> routines , if that meant "the API can fail, but must not raise an
> exception", that policy is enforced by setting a critical
> section. However, if it were "the API mustn't fail", that cannot be
> realized, I believe.

When I say "we cannot error-out" it means it's too late. What I'd like
to prevent is that the backend process returns an error to the client
after committing the local transaction. Because it will mislead the
user.

>
> > > I thought that we are discussing on fdw-errors during the 2pc-commit
> > > phase.
> > >
> >
> > Yes, I'm also discussing on fdw-errors during the 2pc-commit phase
> > that happens after committing the local transaction.
> >
> > Even if FDW-commit raises an error due to the user's cancel request or
> > whatever reason during committing the prepared foreign transactions,
> > it's too late. The client will get an error like "ERROR:  canceling
> > statement due to user request" and would think the transaction is
> > aborted but it's not true, the local transaction is already committed.
>
> By the way I found that I misread the patch. in v26-0002,
> AtEOXact_FdwXact() is actually called after the
> point-of-no-return. What is the reason for the place?  We can
> error-out before changing the state to TRANS_COMMIT.
>

Are you referring to
v26-0002-Introduce-transaction-manager-for-foreign-transa.patch? If
so, the patch doesn't implement 2pc. I think we can commit the foreign
transaction before changing the state to TRANS_COMMIT but in any case
it cannot ensure atomic commit. It just adds both commit and rollback
transaction APIs so that FDW can control transactions by using these
API, not by XactCallback.

> And if any of the remotes ended with 2pc-commit (not prepare phase)
> failure, consistency of the commit is no longer guaranteed so we have
> no choice other than shutting down the server, or continuing running
> allowing the incosistency.  What do we want in that case?

I think it depends on the failure. If 2pc-commit failed due to network
connection failure or the server crash, we would need to try again
later. We normally expect the prepared transaction is able to be
committed with no issue but in case it could not, I think we can leave
the choice for the user: resolve it manually after recovered, give up
etc.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
(v26 fails on the current master)

At Wed, 14 Oct 2020 13:52:49 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Wed, 14 Oct 2020 at 13:19, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > At Wed, 14 Oct 2020 12:09:34 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > > On Wed, 14 Oct 2020 at 10:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrot> > There are cases of
commit-failureof a local transaction caused by
 
> > > > too-many notifications or by serialization failure.
> > >
> > > Yes, even if that happens we are still able to rollback all foreign
> > > transactions.
> >
> > Mmm. I'm confused. If this is about 2pc-commit-request(or prepare)
> > phase, we can rollback the remote transactions. But I think we're
> > focusing 2pc-commit phase. remote transaction that has already
> > 2pc-committed, they can be no longer rollback'ed.
> 
> Did you mention a failure of local commit, right? With the current
> approach, we prepare all foreign transactions first and then commit
> the local transaction. After committing the local transaction we
> commit the prepared foreign transactions. So suppose a serialization
> failure happens during committing the local transaction, we still are
> able to roll back foreign transactions. The check of serialization
> failure of the foreign transactions has already been done at the
> prepare phase.

Understood.

> > > > > to commit the local transaction without preparation, the local
> > > > > transaction must be committed at last. But since the above sequence
> > > > > doesn’t follow this protocol, we will have such problems. I think if
> > > > > we follow the 2pc properly, such basic failures don't happen.
> > > >
> > > > True. But I haven't suggested that sequence.
> > >
> > > Okay, I might have missed your point. Could you elaborate on the idea
> > > you mentioned before, "I think remote-commits should be performed
> > > before local commit passes the point-of-no-return"?
> >
> > It is simply the condition that we can ERROR-out from
> > CommitTransaction. I thought that when you say like "we cannot
> > ERROR-out" you meant "since that is raised to FATAL", but it seems to
> > me that both of you are looking another aspect.
> >
> > If the aspect is "what to do complete the all-prepared p2c transaction
> > at all costs", I'd say "there's a fundamental limitaion".  Although
> > I'm not sure what you mean exactly by prohibiting errors from fdw
> > routines , if that meant "the API can fail, but must not raise an
> > exception", that policy is enforced by setting a critical
> > section. However, if it were "the API mustn't fail", that cannot be
> > realized, I believe.
> 
> When I say "we cannot error-out" it means it's too late. What I'd like
> to prevent is that the backend process returns an error to the client
> after committing the local transaction. Because it will mislead the
> user.

Anyway we don't do anything that can fail after changing state to
TRANS_COMMIT. So we cannot run fdw-2pc-commit after that since it
cannot be failure-proof. if we do them before the point we cannot
ERROR-out after local commit completes.

> > > > I thought that we are discussing on fdw-errors during the 2pc-commit
> > > > phase.
> > > >
> > >
> > > Yes, I'm also discussing on fdw-errors during the 2pc-commit phase
> > > that happens after committing the local transaction.
> > >
> > > Even if FDW-commit raises an error due to the user's cancel request or
> > > whatever reason during committing the prepared foreign transactions,
> > > it's too late. The client will get an error like "ERROR:  canceling
> > > statement due to user request" and would think the transaction is
> > > aborted but it's not true, the local transaction is already committed.
> >
> > By the way I found that I misread the patch. in v26-0002,
> > AtEOXact_FdwXact() is actually called after the
> > point-of-no-return. What is the reason for the place?  We can
> > error-out before changing the state to TRANS_COMMIT.
> >
> 
> Are you referring to
> v26-0002-Introduce-transaction-manager-for-foreign-transa.patch? If
> so, the patch doesn't implement 2pc. I think we can commit the foreign

Ah, I guessed that the trigger points of PREPARE and COMMIT that are
inserted by 0002 won't be moved by the following patches. So the
direction of my discussion doesn't change by the fact.

> transaction before changing the state to TRANS_COMMIT but in any case
> it cannot ensure atomic commit. It just adds both commit and rollback

I guess that you have the local-commit-failure case in mind? Couldn't
we internally prepare the local transaction then following the correct
p2c protocol involving the local transaction? (I'm looking v26-0008)

> transaction APIs so that FDW can control transactions by using these
> API, not by XactCallback.

> > And if any of the remotes ended with 2pc-commit (not prepare phase)
> > failure, consistency of the commit is no longer guaranteed so we have
> > no choice other than shutting down the server, or continuing running
> > allowing the incosistency.  What do we want in that case?
> 
> I think it depends on the failure. If 2pc-commit failed due to network
> connection failure or the server crash, we would need to try again
> later. We normally expect the prepared transaction is able to be
> committed with no issue but in case it could not, I think we can leave
> the choice for the user: resolve it manually after recovered, give up
> etc.

Understood.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 14 Oct 2020 at 17:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> (v26 fails on the current master)

Thanks, I'll update the patch.

>
> At Wed, 14 Oct 2020 13:52:49 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > On Wed, 14 Oct 2020 at 13:19, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Wed, 14 Oct 2020 12:09:34 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in
> > > > On Wed, 14 Oct 2020 at 10:16, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrot> > There are cases of
commit-failureof a local transaction caused by 
> > > > > too-many notifications or by serialization failure.
> > > >
> > > > Yes, even if that happens we are still able to rollback all foreign
> > > > transactions.
> > >
> > > Mmm. I'm confused. If this is about 2pc-commit-request(or prepare)
> > > phase, we can rollback the remote transactions. But I think we're
> > > focusing 2pc-commit phase. remote transaction that has already
> > > 2pc-committed, they can be no longer rollback'ed.
> >
> > Did you mention a failure of local commit, right? With the current
> > approach, we prepare all foreign transactions first and then commit
> > the local transaction. After committing the local transaction we
> > commit the prepared foreign transactions. So suppose a serialization
> > failure happens during committing the local transaction, we still are
> > able to roll back foreign transactions. The check of serialization
> > failure of the foreign transactions has already been done at the
> > prepare phase.
>
> Understood.
>
> > > > > > to commit the local transaction without preparation, the local
> > > > > > transaction must be committed at last. But since the above sequence
> > > > > > doesn’t follow this protocol, we will have such problems. I think if
> > > > > > we follow the 2pc properly, such basic failures don't happen.
> > > > >
> > > > > True. But I haven't suggested that sequence.
> > > >
> > > > Okay, I might have missed your point. Could you elaborate on the idea
> > > > you mentioned before, "I think remote-commits should be performed
> > > > before local commit passes the point-of-no-return"?
> > >
> > > It is simply the condition that we can ERROR-out from
> > > CommitTransaction. I thought that when you say like "we cannot
> > > ERROR-out" you meant "since that is raised to FATAL", but it seems to
> > > me that both of you are looking another aspect.
> > >
> > > If the aspect is "what to do complete the all-prepared p2c transaction
> > > at all costs", I'd say "there's a fundamental limitaion".  Although
> > > I'm not sure what you mean exactly by prohibiting errors from fdw
> > > routines , if that meant "the API can fail, but must not raise an
> > > exception", that policy is enforced by setting a critical
> > > section. However, if it were "the API mustn't fail", that cannot be
> > > realized, I believe.
> >
> > When I say "we cannot error-out" it means it's too late. What I'd like
> > to prevent is that the backend process returns an error to the client
> > after committing the local transaction. Because it will mislead the
> > user.
>
> Anyway we don't do anything that can fail after changing state to
> TRANS_COMMIT. So we cannot run fdw-2pc-commit after that since it
> cannot be failure-proof. if we do them before the point we cannot
> ERROR-out after local commit completes.
>
> > > > > I thought that we are discussing on fdw-errors during the 2pc-commit
> > > > > phase.
> > > > >
> > > >
> > > > Yes, I'm also discussing on fdw-errors during the 2pc-commit phase
> > > > that happens after committing the local transaction.
> > > >
> > > > Even if FDW-commit raises an error due to the user's cancel request or
> > > > whatever reason during committing the prepared foreign transactions,
> > > > it's too late. The client will get an error like "ERROR:  canceling
> > > > statement due to user request" and would think the transaction is
> > > > aborted but it's not true, the local transaction is already committed.
> > >
> > > By the way I found that I misread the patch. in v26-0002,
> > > AtEOXact_FdwXact() is actually called after the
> > > point-of-no-return. What is the reason for the place?  We can
> > > error-out before changing the state to TRANS_COMMIT.
> > >
> >
> > Are you referring to
> > v26-0002-Introduce-transaction-manager-for-foreign-transa.patch? If
> > so, the patch doesn't implement 2pc. I think we can commit the foreign
>
> Ah, I guessed that the trigger points of PREPARE and COMMIT that are
> inserted by 0002 won't be moved by the following patches. So the
> direction of my discussion doesn't change by the fact.
>
> > transaction before changing the state to TRANS_COMMIT but in any case
> > it cannot ensure atomic commit. It just adds both commit and rollback
>
> I guess that you have the local-commit-failure case in mind? Couldn't
> we internally prepare the local transaction then following the correct
> p2c protocol involving the local transaction? (I'm looking v26-0008)

Yes, we could. But as I mentioned before if we always commit the local
transaction last, we don't necessarily need to prepare the local
transaction. If we prepared the local transaction, I think we would be
able to allow FDW's commit routine to raise an error even during
2pc-commit, but only for the first time. Once we committed any one of
the involved transactions including the local transaction and foreign
transactions, the commit routine must not raise an error during
2pc-commit for the same reason; it's too late.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 12 Oct 2020 at 17:19, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > I was thinking to have a GUC timeout parameter like statement_timeout.
> > The backend waits for the setting value when resolving foreign
> > transactions.
>
> Me too.
>
>
> > But this idea seems different. FDW can set its timeout
> > via a transaction timeout API, is that right?
>
> I'm not perfectly sure about how the TM( application server works) , but probably no.  The TM has a configuration
parameterfor transaction timeout, and the TM calls XAResource.setTransactionTimeout() with that or smaller value for
theargument. 
>
>
> > But even if FDW can set
> > the timeout using a transaction timeout API, the problem that client
> > libraries for some DBMS don't support interruptible functions still
> > remains. The user can set a short time to the timeout but it also
> > leads to unnecessary timeouts. Thoughts?
>
> Unfortunately, I'm afraid we can do nothing about it.  If the DBMS's client library doesn't support cancellation
(e.g.doesn't respond to Ctrl+C or provide a function that cancel processing in pgorogss), then the Postgres user just
findsthat he can't cancel queries (just like we experienced with odbc_fdw.) 

So the idea of using another process to commit prepared foreign
transactions seems better also in terms of this point. Even if a DBMS
client library doesn’t support query cancellation, the transaction
commit can return the control to the client when the user press ctl-c
as the backend process is just sleeping using WaitLatch() (it’s
similar to synchronous replication)

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > Unfortunately, I'm afraid we can do nothing about it.  If the DBMS's client
> library doesn't support cancellation (e.g. doesn't respond to Ctrl+C or provide a
> function that cancel processing in pgorogss), then the Postgres user just finds
> that he can't cancel queries (just like we experienced with odbc_fdw.)
> 
> So the idea of using another process to commit prepared foreign
> transactions seems better also in terms of this point. Even if a DBMS
> client library doesn’t support query cancellation, the transaction
> commit can return the control to the client when the user press ctl-c
> as the backend process is just sleeping using WaitLatch() (it’s
> similar to synchronous replication)

I have to say that's nitpicking.  I believe almost nobody does, or cares about, canceling commits, at the expense of
impracticalperformance due to non-parallelism, serial execution in each resolver,  and context switches.
 

Also, FDW is not cancellable in general.  It makes no sense to care only about commit.

(Fortunately, postgres_fdw is cancellable in any way.)


Regards
Takayuki Tsunakawa




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 19 Oct 2020 at 14:39, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > Unfortunately, I'm afraid we can do nothing about it.  If the DBMS's client
> > library doesn't support cancellation (e.g. doesn't respond to Ctrl+C or provide a
> > function that cancel processing in pgorogss), then the Postgres user just finds
> > that he can't cancel queries (just like we experienced with odbc_fdw.)
> >
> > So the idea of using another process to commit prepared foreign
> > transactions seems better also in terms of this point. Even if a DBMS
> > client library doesn’t support query cancellation, the transaction
> > commit can return the control to the client when the user press ctl-c
> > as the backend process is just sleeping using WaitLatch() (it’s
> > similar to synchronous replication)
>
> I have to say that's nitpicking.  I believe almost nobody does, or cares about, canceling commits,

Really? I don’t think so. I think It’s terrible that the query gets
stuck for a long time and we cannot do anything than waiting until a
crashed foreign server is restored. We can have a timeout but I don’t
think every user wants to use the timeout or the user might want to
set a timeout to a relatively large value by the concern of
misdetection. I guess synchronous replication had similar concerns so
it has a similar mechanism.

> at the expense of impractical performance due to non-parallelism, serial execution in each resolver,  and context
switches.

I have never said that we’re going to live with serial execution in
each resolver and non-parallelism. I've been repeatedly saying that it
would be possible that we improve this feature over the releases to
get a good performance even if we use a separate background process.
Using a background process to commit is the only option to support
interruptible foreign transaction resolution for now whereas there are
some ideas for performance improvements. I think we don't have enough
discussion on how we can improve the idea of using a separate process
and how much performance will improve and how possible it is. It's not
late to reject that idea after the discussion.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> On Mon, 19 Oct 2020 at 14:39, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > I have to say that's nitpicking.  I believe almost nobody does, or cares about,
> canceling commits,
> 
> Really? I don’t think so. I think It’s terrible that the query gets
> stuck for a long time and we cannot do anything than waiting until a
> crashed foreign server is restored. We can have a timeout but I don’t
> think every user wants to use the timeout or the user might want to
> set a timeout to a relatively large value by the concern of
> misdetection. I guess synchronous replication had similar concerns so
> it has a similar mechanism.

Really.  I thought we were talking about canceling commits with Ctrl + C as you referred, right?  I couldn't imagine,
inproduction environments where many sessions are running transactions concurrently,  how the user (DBA) wants and can
canceleach stuck session during commit one by one with Ctrl + C by hand.  I haven't seen such a feature exist or been
consideredcrucial that enables the user (administrator) to cancel running processing with Ctrl + C from the side.
 

Rather, setting appropriate timeout is the current sound system design , isn't it?  It spans many areas - TCP/IP,
heartbeatsof load balancers and clustering software, request and response to application servers and database servers,
etc. I sympathize with your concern that users may not be confident about their settings.  But that's the current
practiceunfortunately.
 


> > at the expense of impractical performance due to non-parallelism, serial
> execution in each resolver,  and context switches.
> 
> I have never said that we’re going to live with serial execution in
> each resolver and non-parallelism. I've been repeatedly saying that it
> would be possible that we improve this feature over the releases to
> get a good performance even if we use a separate background process.

IIRC, I haven't seen a reasonable design based on a separate process that handles commits during normal operation.
WhatI heard is to launch as many resolvers as the client sessions, but that consumes too much resource as I said.
 


> Using a background process to commit is the only option to support
> interruptible foreign transaction resolution for now whereas there are
> some ideas for performance improvements.

A practical solution is the timeout for the FDW in general, as in application servers.  postgres_fdw can benefit from
Ctrl+ C as well.
 


> I think we don't have enough
> discussion on how we can improve the idea of using a separate process
> and how much performance will improve and how possible it is. It's not
> late to reject that idea after the discussion.

Yeah, I agree that discussion is not enough yet.  In other words, the design has not reached the quality for the first
releaseyet.  We should try to avoid using "Hopefully, we should be able to improve in the next release (I haven't seen
thedesign in light, though)" as an excuse for getting a half-baked patch committed that does not offer practical
quality. I saw many developers' patches were rejected because of insufficient performance, e.g. even 0.8% performance
impact. (I'm one of those developers, actually...)  I have been feeling this community is rigorous about performance.
Wehave to be sincere.
 

Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Ashutosh Bapat
Date:
On Mon, Oct 19, 2020 at 2:37 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> Really.  I thought we were talking about canceling commits with Ctrl + C as you referred, right?  I couldn't imagine,
inproduction environments where many sessions are running transactions concurrently,  how the user (DBA) wants and can
canceleach stuck session during commit one by one with Ctrl + C by hand.  I haven't seen such a feature exist or been
consideredcrucial that enables the user (administrator) to cancel running processing with Ctrl + C from the side. 

Using pg_cancel_backend() and pg_terminate_backend() a DBA can cancel
running query from any backend or terminate a backend. For either to
work the backend needs to be interruptible. IIRC, Robert had made an
effort to make postgres_fdw interruptible few years back.

--
Best Wishes,
Ashutosh Bapat



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, 19 Oct 2020 at 20:37, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Mon, Oct 19, 2020 at 2:37 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > Really.  I thought we were talking about canceling commits with Ctrl + C as you referred, right?  I couldn't
imagine,in production environments where many sessions are running transactions concurrently,  how the user (DBA) wants
andcan cancel each stuck session during commit one by one with Ctrl + C by hand.  I haven't seen such a feature exist
orbeen considered crucial that enables the user (administrator) to cancel running processing with Ctrl + C from the
side.
>
> Using pg_cancel_backend() and pg_terminate_backend() a DBA can cancel
> running query from any backend or terminate a backend. For either to
> work the backend needs to be interruptible. IIRC, Robert had made an
> effort to make postgres_fdw interruptible few years back.

Right. Also, We discussed having a timeout on the core side but I'm
concerned that the timeout also might not work if it's not
interruptible.

While using the timeout is a good idea, I have to think there is also
a certain number of the user who doesn't use this timeout as there is
a certain number of the users who doesn't use timeouts such as
statement_timeout. We must not ignore such users and It might not be
advisable to design a feature that ignores such users.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> Using pg_cancel_backend() and pg_terminate_backend() a DBA can cancel
> running query from any backend or terminate a backend. For either to
> work the backend needs to be interruptible. IIRC, Robert had made an
> effort to make postgres_fdw interruptible few years back.

Yeah, I know those functions.  Sawada-san was talking about Ctrl + C, so I responded accordingly.

Also, how can the DBA find sessions to run those functions against?  Can he tell if a session is connected to or
runningSQL to a given foreign server?  Can he terminate or cancel all session with one SQL command that are stuck in
accessinga particular foreign server?
 

Furthermore, FDW is not cancellable in general.  So, I don't see a point in trying hard to make only commit be
cancelable.


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 20 Oct 2020 02:44:09 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > Using pg_cancel_backend() and pg_terminate_backend() a DBA can cancel
> > running query from any backend or terminate a backend. For either to
> > work the backend needs to be interruptible. IIRC, Robert had made an
> > effort to make postgres_fdw interruptible few years back.
> 
> Yeah, I know those functions.  Sawada-san was talking about Ctrl + C, so I responded accordingly.
> 
> Also, how can the DBA find sessions to run those functions against?  Can he tell if a session is connected to or
runningSQL to a given foreign server?  Can he terminate or cancel all session with one SQL command that are stuck in
accessinga particular foreign server?
 

I don't think the inability to cancel all session at once cannot be a
reason not to not to allow operators to cancel a stuck session.

> Furthermore, FDW is not cancellable in general.  So, I don't see a point in trying hard to make only commit be
cancelable.

I think that it is quite important that operators can cancel any
process that has been stuck for a long time. Furthermore, postgres_fdw
is more likely to be stuck since network is involved so the usefulness
of that feature would be higher.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> I don't think the inability to cancel all session at once cannot be a
> reason not to not to allow operators to cancel a stuck session.

Yeah, I didn't mean to discount the ability to cancel queries.  I just want to confirm how the user can use the
cancellationin practice.  I didn't see how the user can use the cancellation in the FDW framework, so I asked about it.
We have to think about the user's context if we regard canceling commits as important. 


> > Furthermore, FDW is not cancellable in general.  So, I don't see a point in
> trying hard to make only commit be cancelable.
>
> I think that it is quite important that operators can cancel any
> process that has been stuck for a long time. Furthermore, postgres_fdw
> is more likely to be stuck since network is involved so the usefulness
> of that feature would be higher.

But lower than practical performance during normal operation.

BTW, speaking of network, how can postgres_fdw respond quickly to cancel request when libpq is waiting for a reply from
adown foreign server?  Can the user continue to use that session after cancellation? 


Regards
Takayuki Tsunakawa




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 20 Oct 2020 at 13:23, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > I don't think the inability to cancel all session at once cannot be a
> > reason not to not to allow operators to cancel a stuck session.
>
> Yeah, I didn't mean to discount the ability to cancel queries.  I just want to confirm how the user can use the
cancellationin practice.  I didn't see how the user can use the cancellation in the FDW framework, so I asked about it.
We have to think about the user's context if we regard canceling commits as important. 
>

I think it doesn't matter whether in FDW framework or not. The user
normally doesn't care which backend processes connecting to foreign
servers. They will attempt to cancel the query like always if they
realized that a backend gets stuck. There are surely plenty of users
who use query cancellation.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 20 Oct 2020 15:53:29 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Tue, 20 Oct 2020 at 13:23, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > > I don't think the inability to cancel all session at once cannot be a
> > > reason not to not to allow operators to cancel a stuck session.
> >
> > Yeah, I didn't mean to discount the ability to cancel queries.  I just want to confirm how the user can use the
cancellationin practice.  I didn't see how the user can use the cancellation in the FDW framework, so I asked about it.
We have to think about the user's context if we regard canceling commits as important.
 
> >
> 
> I think it doesn't matter whether in FDW framework or not. The user
> normally doesn't care which backend processes connecting to foreign
> servers. They will attempt to cancel the query like always if they
> realized that a backend gets stuck. There are surely plenty of users
> who use query cancellation.

The most serious impact from inability of canceling a query on a
certain session is that server-restart is required to end such a
session.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 20 Oct 2020 04:23:12 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > > Furthermore, FDW is not cancellable in general.  So, I don't see a point in
> > trying hard to make only commit be cancelable.
> > 
> > I think that it is quite important that operators can cancel any
> > process that has been stuck for a long time. Furthermore, postgres_fdw
> > is more likely to be stuck since network is involved so the usefulness
> > of that feature would be higher.
> 
> But lower than practical performance during normal operation.
> 
> BTW, speaking of network, how can postgres_fdw respond quickly to cancel request when libpq is waiting for a reply
froma down foreign server?  Can the user continue to use that session after cancellation?
 

It seems to respond to a statement-cancel signal immediately while
waiting for a coming byte.  However, seems to wait forever while
waiting a space in send-buffer. (Is that mean the session will be
stuck if it sends a large chunk of bytes while the network is down?)

After receiving a signal, it closes the problem connection. So the
local session is usable after that but the fiailed remote sessions are
closed and created another one at the next use.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> At Tue, 20 Oct 2020 15:53:29 +0900, Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote in
> > I think it doesn't matter whether in FDW framework or not. The user
> > normally doesn't care which backend processes connecting to foreign
> > servers. They will attempt to cancel the query like always if they
> > realized that a backend gets stuck. There are surely plenty of users
> > who use query cancellation.
>
> The most serious impact from inability of canceling a query on a
> certain session is that server-restart is required to end such a
> session.

OK, as I may be repeating, I didn't deny the need for cancellation.  Let''s organize the argument.

* FDW in general
My understanding is that the FDW feature does not stipulate anything about cancellation.  In fact, odbc_fdw was
uncancelable. What do we do about this? 

* postgres_fdw
Fortunately, it is (should be?) cancelable whatever method we choose for 2PC.  So no problem.
But is it really cancellable now?  What if the libpq call is waiting for response when the foreign server or network is
down?

"Inability to cancel requires database server restart" feels a bit exaggerating, as libpq has tcp_keepalive* and
tcp_user_timeoutconnection parameters, and even without setting them, TCP timeout works. 


Regards
Takayuki Tsunakawa




RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> It seems to respond to a statement-cancel signal immediately while
> waiting for a coming byte.  However, seems to wait forever while
> waiting a space in send-buffer. (Is that mean the session will be
> stuck if it sends a large chunk of bytes while the network is down?)

What part makes you worried about that?  libpq's send processing?

I've just examined pgfdw_cancel_query(), too.  As below, it uses a hidden 30 second timeout.  After all, postgres_fdw
alsorelies on timeout already. 

    /*
     * If it takes too long to cancel the query and discard the result, assume
     * the connection is dead.
     */
    endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), 30000);


> After receiving a signal, it closes the problem connection. So the
> local session is usable after that but the fiailed remote sessions are
> closed and created another one at the next use.

I couldn't see that the problematic connection is closed when the cancellation fails... Am I looking at a wrong place?

                    /*
                     * If connection is already unsalvageable, don't touch it
                     * further.
                     */
                    if (entry->changing_xact_state)
                        break;


Regards
Takayuki Tsunakawa




Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 20 Oct 2020 at 16:54, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > At Tue, 20 Oct 2020 15:53:29 +0900, Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote in
> > > I think it doesn't matter whether in FDW framework or not. The user
> > > normally doesn't care which backend processes connecting to foreign
> > > servers. They will attempt to cancel the query like always if they
> > > realized that a backend gets stuck. There are surely plenty of users
> > > who use query cancellation.
> >
> > The most serious impact from inability of canceling a query on a
> > certain session is that server-restart is required to end such a
> > session.
>
> OK, as I may be repeating, I didn't deny the need for cancellation.

So what's your opinion?

> Let''s organize the argument.
>
> * FDW in general
> My understanding is that the FDW feature does not stipulate anything about cancellation.  In fact, odbc_fdw was
uncancelable. What do we do about this? 
>
> * postgres_fdw
> Fortunately, it is (should be?) cancelable whatever method we choose for 2PC.  So no problem.
> But is it really cancellable now?  What if the libpq call is waiting for response when the foreign server or network
isdown? 

I don’t think we need to stipulate the query cancellation. Anyway I
guess the facts neither that we don’t stipulate anything about query
cancellation now nor that postgres_fdw might not be cancellable in
some situations now are not a reason for not supporting query
cancellation. If it's a desirable behavior and users want it, we need
to put an effort to support it as much as possible like we’ve done in
postgres_fdw.  Some FDWs unfortunately might not be able to support it
only by their functionality but it would be good if we can achieve
that by combination of PostgreSQL and FDW plugins.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, 20 Oct 2020 at 17:56, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > It seems to respond to a statement-cancel signal immediately while
> > waiting for a coming byte.  However, seems to wait forever while
> > waiting a space in send-buffer. (Is that mean the session will be
> > stuck if it sends a large chunk of bytes while the network is down?)
>
> What part makes you worried about that?  libpq's send processing?
>
> I've just examined pgfdw_cancel_query(), too.  As below, it uses a hidden 30 second timeout.  After all, postgres_fdw
alsorelies on timeout already.
 

It uses the timeout but it's also cancellable before the timeout. See
we call CHECK_FOR_INTERRUPTS() in pgfdw_get_cleanup_result().

>
>
> > After receiving a signal, it closes the problem connection. So the
> > local session is usable after that but the fiailed remote sessions are
> > closed and created another one at the next use.
>
> I couldn't see that the problematic connection is closed when the cancellation fails... Am I looking at a wrong
place?
>
>                     /*
>                      * If connection is already unsalvageable, don't touch it
>                      * further.
>                      */
>                     if (entry->changing_xact_state)
>                         break;
>

I guess Horiguchi-san refereed the following code in pgfdw_xact_callback():

        /*
         * If the connection isn't in a good idle state, discard it to
         * recover. Next GetConnection will open a new connection.
         */
        if (PQstatus(entry->conn) != CONNECTION_OK ||
            PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
            entry->changing_xact_state)
        {
            elog(DEBUG3, "discarding connection %p", entry->conn);
            disconnect_pg_server(entry);
        }

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 20 Oct 2020 21:22:31 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in 
> On Tue, 20 Oct 2020 at 17:56, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > > It seems to respond to a statement-cancel signal immediately while
> > > waiting for a coming byte.  However, seems to wait forever while
> > > waiting a space in send-buffer. (Is that mean the session will be
> > > stuck if it sends a large chunk of bytes while the network is down?)
> >
> > What part makes you worried about that?  libpq's send processing?
> >
> > I've just examined pgfdw_cancel_query(), too.  As below, it uses a hidden 30 second timeout.  After all,
postgres_fdwalso relies on timeout already.
 
> 
> It uses the timeout but it's also cancellable before the timeout. See
> we call CHECK_FOR_INTERRUPTS() in pgfdw_get_cleanup_result().

Yes. And as Sawada-san mentioned it's not a matter if a specific FDW
module accepts cancellation or not. It's sufficient that we have one
example. Other FDWs will follow postgres_fdw if needed.

> > > After receiving a signal, it closes the problem connection. So the
> > > local session is usable after that but the fiailed remote sessions are
> > > closed and created another one at the next use.
> >
> > I couldn't see that the problematic connection is closed when the cancellation fails... Am I looking at a wrong
place?
...
> 
> I guess Horiguchi-san refereed the following code in pgfdw_xact_callback():
> 
>         /*
>          * If the connection isn't in a good idle state, discard it to
>          * recover. Next GetConnection will open a new connection.
>          */
>         if (PQstatus(entry->conn) != CONNECTION_OK ||
>             PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
>             entry->changing_xact_state)
>         {
>             elog(DEBUG3, "discarding connection %p", entry->conn);
>             disconnect_pg_server(entry);
>         }

Right.  Although it's not directly relevant to this discussion,
precisely, that part is not visited just after the remote "COMMIT
TRANSACTION" failed. If that commit fails or is canceled, an exception
is raised while entry->changing_xact_state = true. Then the function
is called again within AbortCurrentTransaction() and reaches the above
code.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> >         if (PQstatus(entry->conn) != CONNECTION_OK ||
> >             PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> >             entry->changing_xact_state)
> >         {
> >             elog(DEBUG3, "discarding connection %p", entry->conn);
> >             disconnect_pg_server(entry);
> >         }
>
> Right.  Although it's not directly relevant to this discussion,
> precisely, that part is not visited just after the remote "COMMIT
> TRANSACTION" failed. If that commit fails or is canceled, an exception
> is raised while entry->changing_xact_state = true. Then the function
> is called again within AbortCurrentTransaction() and reaches the above
> code.

Ah, then the connection to the foreign server is closed after failing to cancel the query.  Thanks.


Regards
Takayuki Tsunakawa




RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> So what's your opinion?

My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas (others may
havepointed out something else too, but I don't remember), before going deeper into the code review.
 

* FDW interface
New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on to
validatethe FDW interface.
 
What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?

* Performance
Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be the
firstrelease quality.  I proposed the idea.
 
(If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.  I want
tokeep Postgres's reputation.)
 
As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)  That
helpsevaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm simply
imaginingwhat is typically written in database textbooks and research papers.  I'm asking this because I saw some
discussionin this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations.
 

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

* Query cancellation
As you showed, there's no problem with postgres_fdw?
The cancelability of FDW in general remains a problem, but that can be a separate undertaking.

* Global visibility
This is what Amit-san suggested some times -- "design it before reviewing the current patch."  I'm a bit optimistic
aboutthis and think this FDW 2PC can be implemented separately as a pure enhancement of FDW.  But I also understand his
concern. If your (our?) aim is to use this FDW 2PC for sharding, we may have to design the combination of 2PC and
visibilityfirst.
 



> I don’t think we need to stipulate the query cancellation. Anyway I
> guess the facts neither that we don’t stipulate anything about query
> cancellation now nor that postgres_fdw might not be cancellable in
> some situations now are not a reason for not supporting query
> cancellation. If it's a desirable behavior and users want it, we need
> to put an effort to support it as much as possible like we’ve done in
> postgres_fdw.  Some FDWs unfortunately might not be able to support it
> only by their functionality but it would be good if we can achieve
> that by combination of PostgreSQL and FDW plugins.

Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW interface
andits documentation so that FDW developers can implement what we consider important -- query cancellation in your
discussion. "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't use.
That'swhat Tomas Vondra pointed out several years ago.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Amit Kapila
Date:
On Wed, Oct 21, 2020 at 3:03 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > So what's your opinion?
>
> * Global visibility
> This is what Amit-san suggested some times -- "design it before reviewing the current patch."  I'm a bit optimistic
aboutthis and think this FDW 2PC can be implemented separately as a pure enhancement of FDW.  But I also understand his
concern. If your (our?) aim is to use this FDW 2PC for sharding, 
>

As far as I understand that is what the goal is for which this is a
step. For example, see the wiki [1]. I understand that wiki is not the
final thing but I have seen other places as well where there is a
mention of FDW based sharding and I feel this is the reason why many
people are trying to improve this area. That is why I suggested having
an upfront design of global visibility and a deadlock detector along
with this work.


[1] - https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

--
With Regards,
Amit Kapila.



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > So what's your opinion?
>
> My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas (others
mayhave pointed out something else too, but I don't remember), before going deeper into the code review. 
>
> * FDW interface
> New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on to
validatethe FDW interface. 
> What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?

I guess since FDW interfaces may be affected by the feature
architecture we can discuss later.

> * Performance
> Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be the
firstrelease quality.  I proposed the idea. 
> (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.  I
wantto keep Postgres's reputation.) 

What is in your mind regarding the implementation of parallel prepare
and commit? Given that some FDW plugins don't support asynchronous
execution I guess we need to use parallel workers or something. That
is, the backend process launches parallel workers to
prepare/commit/rollback foreign transactions in parallel. I don't deny
this approach but it'll definitely make the feature complex and needs
more codes.

My point is a small start and keeping simple the first version. Even
if we need one or more years for this feature, I think that
introducing the simple and minimum functionality as the first version
to the core still has benefits. We will be able to have the
opportunity to get real feedback from users and to fix bugs in the
main infrastructure before making it complex. In this sense, the patch
having the backend return without waits for resolution after the local
commit would be a good start as the first version (i.g., up to
applying v26-0006 patch). Anyway, the architecture should be
extensible enough for future improvements.

For the performance improvements, we will be able to support
asynchronous and/or prepare/commit/rollback. Moreover, having multiple
resolver processes on one database would also help get better
through-put. For the user who needs much better through-put, the user
also can select not to wait for resolution after the local commit,
like synchronous_commit = ‘local’ in replication.

> As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)
Thathelps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm
simplyimagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
>
> Atomic Commit of Distributed Transactions
> https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Understood. I'll add an explanation about the message flow and disk
writes to the wiki page.

We need to consider the point of error handling during resolving
foreign transactions too.

>
> > I don’t think we need to stipulate the query cancellation. Anyway I
> > guess the facts neither that we don’t stipulate anything about query
> > cancellation now nor that postgres_fdw might not be cancellable in
> > some situations now are not a reason for not supporting query
> > cancellation. If it's a desirable behavior and users want it, we need
> > to put an effort to support it as much as possible like we’ve done in
> > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > only by their functionality but it would be good if we can achieve
> > that by combination of PostgreSQL and FDW plugins.
>
> Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 

I suspect the story is somewhat different. libpq fortunately supports
asynchronous execution, but when it comes to canceling the foreign
transaction resolution I think basically all FDW plugins are in the
same situation at this time. We can choose whether to make it
cancellable or not. According to the discussion so far, it completely
depends on the architecture of this feature. So my point is whether
it's worth to have this functionality for users and whether users want
it, not whether postgres_fdw is ok.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > So what's your opinion?
> >
> > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas (others
mayhave pointed out something else too, but I don't remember), before going deeper into the code review. 
> >
> > * FDW interface
> > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on to
validatethe FDW interface. 
> > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
>
> I guess since FDW interfaces may be affected by the feature
> architecture we can discuss later.
>
> > * Performance
> > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be
thefirst release quality.  I proposed the idea. 
> > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.  I
wantto keep Postgres's reputation.) 
>
> What is in your mind regarding the implementation of parallel prepare
> and commit? Given that some FDW plugins don't support asynchronous
> execution I guess we need to use parallel workers or something. That
> is, the backend process launches parallel workers to
> prepare/commit/rollback foreign transactions in parallel. I don't deny
> this approach but it'll definitely make the feature complex and needs
> more codes.
>
> My point is a small start and keeping simple the first version. Even
> if we need one or more years for this feature, I think that
> introducing the simple and minimum functionality as the first version
> to the core still has benefits. We will be able to have the
> opportunity to get real feedback from users and to fix bugs in the
> main infrastructure before making it complex. In this sense, the patch
> having the backend return without waits for resolution after the local
> commit would be a good start as the first version (i.g., up to
> applying v26-0006 patch). Anyway, the architecture should be
> extensible enough for future improvements.
>
> For the performance improvements, we will be able to support
> asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> resolver processes on one database would also help get better
> through-put. For the user who needs much better through-put, the user
> also can select not to wait for resolution after the local commit,
> like synchronous_commit = ‘local’ in replication.
>
> > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)
Thathelps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm
simplyimagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> >
> > Atomic Commit of Distributed Transactions
> > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
>
> Understood. I'll add an explanation about the message flow and disk
> writes to the wiki page.

Done.

>
> We need to consider the point of error handling during resolving
> foreign transactions too.
>
> >
> > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > guess the facts neither that we don’t stipulate anything about query
> > > cancellation now nor that postgres_fdw might not be cancellable in
> > > some situations now are not a reason for not supporting query
> > > cancellation. If it's a desirable behavior and users want it, we need
> > > to put an effort to support it as much as possible like we’ve done in
> > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > only by their functionality but it would be good if we can achieve
> > > that by combination of PostgreSQL and FDW plugins.
> >
> > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
>
> I suspect the story is somewhat different. libpq fortunately supports
> asynchronous execution, but when it comes to canceling the foreign
> transaction resolution I think basically all FDW plugins are in the
> same situation at this time. We can choose whether to make it
> cancellable or not. According to the discussion so far, it completely
> depends on the architecture of this feature. So my point is whether
> it's worth to have this functionality for users and whether users want
> it, not whether postgres_fdw is ok.
>

I've thought again about the idea that once the backend failed to
resolve a foreign transaction it leaves to a resolver process. With
this idea, the backend process perform the 2nd phase of 2PC only once.
If an error happens during resolution it leaves to a resolver process
and returns an error to the client. We used to use this idea in the
previous patches and it’s discussed sometimes.

First of all, this idea doesn’t resolve the problem of error handling
that the transaction could return an error to the client in spite of
having been committed the local transaction. There is an argument that
this behavior could also happen even in a single server environment
but I guess the situation is slightly different. Basically what the
transaction does after the commit is cleanup. An error could happen
during cleanup but if it happens it’s likely due to a  bug of
something wrong inside PostgreSQL or OS. On the other hand, during and
after resolution the transaction does major works such as connecting a
foreign server, sending an SQL, getting the result, and writing a WAL
to remove the entry. These are more likely to happen an error.

Also, with this idea, the client needs to check if the error got from
the server is really true because the local transaction might have
been committed. Although this could happen even in a single server
environment how many users check that in practice? If a server
crashes, subsequent transactions end up failing due to a network
connection error but it seems hard to distinguish between such a real
error and the fake error.

Moreover, it’s questionable in terms of extensibility. We would not
able to support keeping waiting for distributed transactions to
complete even if an error happens, like synchronous replication. The
user might want to wait in case where the failure is temporary such as
temporary network disconnection. Trying resolution only once seems to
have cons of both asynchronous and synchronous resolutions.

So I’m thinking that with this idea the user will need to change their
application so that it checks if the error they got is really true,
which is cumbersome for users. Also, it seems to me we need to
circumspectly discuss whether this idea could weaken extensibility.


Anyway, according to the discussion, it seems to me that we got a
consensus so far that the backend process prepares all foreign
transactions and a resolver process is necessary to resolve in-doubt
transaction in background. So I’ve changed the patch set as follows.
Applying these all patches, we can support asynchronous foreign
transaction resolution. That is, at transaction commit the backend
process prepares all foreign transactions, and then commit the local
transaction. After that, it returns OK of commit to the client while
leaving the prepared foreign transaction to a resolver process. A
resolver process fetches the foreign transactions to resolve and
resolves them in background. Since the 2nd phase of 2PC is performed
asynchronously a transaction that wants to see the previous
transaction result needs to check its status.

Here is brief explaination for each patches:

v27-0001-Introduce-transaction-manager-for-foreign-transa.patch

This commit adds the basic foreign transaction manager,
CommitForeignTransaction, and RollbackForeignTransaction API. These
APIs support only one-phase. With this change, FDW is able to control
its transaction using the foreign transaction manager, not using
XactCallback.

v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch

This commit implements both CommitForeignTransaction and
RollbackForeignTransaction APIs in postgres_fdw. Note that since
PREPARE TRANSACTION is still not supported there is nothing the user
newly is able to do.

v27-0003-Recreate-RemoveForeignServerById.patch

This commit recreates RemoveForeignServerById that was removed by
b1d32d3e3. This is necessary because we need to check if there is a
foreign transaction involved with the foreign server that is about to
be removed.

v27-0004-Add-PrepareForeignTransaction-API.patch

This commit adds prepared foreign transaction support including WAL
logging and recovery, and PrepareForeignTransaction API. With this
change, the user is able to do 'PREPARE TRANSACTION’ and
'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
local transaction. It doesn't do anything for foreign transactions.
Therefore, the user needs to resolve foreign transactions manually by
executing the pg_resolve_foreign_xacts() SQL function which is also
introduced by this commit.

v27-0005-postgres_fdw-supports-prepare-API.patch

This commit implements PrepareForeignTransaction API and makes
CommitForeignTransaction and RollbackForeignTransaction supports
two-phase commit.

v27-0006-Add-GetPrepareId-API.patch

This commit adds GetPrepareID API.

v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch

This commit introduces foreign transaction resolver and launcher
processes. With this change, the user doesn’t need to manually execute
pg_resolve_foreign_xacts() function to resolve foreign transactions
prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
Instead, a resolver process automatically resolves them in background.

v27-0008-Prepare-foreign-transactions-at-commit-time.patch

With this commit, the transaction prepares foreign transactions marked
as modified at transaction commit if foreign_twophase_commit is
‘required’. Previously the user needs to do PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
transparently to the user. But the transaction returns OK of commit to
the client after committing the local transaction and notifying the
resolver process, without waits. Foreign transactions are
asynchronously resolved by the resolver process.

v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch

With this commit, the transactions started via postgres_fdw are marked
as modified, which is necessary to use 2PC.

v27-0010-Documentation-update.patch
v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch

Documentation update and regression tests.

The missing piece from the previous version patch is synchronously
transaction resolution. In the previous patch, foreign transactions
are synchronously resolved by a resolver process. But since it's under
discussion whether this is a good approach and I'm considering
optimizing the logic it’s not included in the current patch set.


Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Nov 5, 2020 at 12:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > >
> > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > So what's your opinion?
> > >
> > > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas
(othersmay have pointed out something else too, but I don't remember), before going deeper into the code review. 
> > >
> > > * FDW interface
> > > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on
tovalidate the FDW interface. 
> > > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
> >
> > I guess since FDW interfaces may be affected by the feature
> > architecture we can discuss later.
> >
> > > * Performance
> > > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be
thefirst release quality.  I proposed the idea. 
> > > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.
Iwant to keep Postgres's reputation.) 
> >
> > What is in your mind regarding the implementation of parallel prepare
> > and commit? Given that some FDW plugins don't support asynchronous
> > execution I guess we need to use parallel workers or something. That
> > is, the backend process launches parallel workers to
> > prepare/commit/rollback foreign transactions in parallel. I don't deny
> > this approach but it'll definitely make the feature complex and needs
> > more codes.
> >
> > My point is a small start and keeping simple the first version. Even
> > if we need one or more years for this feature, I think that
> > introducing the simple and minimum functionality as the first version
> > to the core still has benefits. We will be able to have the
> > opportunity to get real feedback from users and to fix bugs in the
> > main infrastructure before making it complex. In this sense, the patch
> > having the backend return without waits for resolution after the local
> > commit would be a good start as the first version (i.g., up to
> > applying v26-0006 patch). Anyway, the architecture should be
> > extensible enough for future improvements.
> >
> > For the performance improvements, we will be able to support
> > asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> > resolver processes on one database would also help get better
> > through-put. For the user who needs much better through-put, the user
> > also can select not to wait for resolution after the local commit,
> > like synchronous_commit = ‘local’ in replication.
> >
> > > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)
That helps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm
simplyimagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> > >
> > > Atomic Commit of Distributed Transactions
> > > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> >
> > Understood. I'll add an explanation about the message flow and disk
> > writes to the wiki page.
>
> Done.
>
> >
> > We need to consider the point of error handling during resolving
> > foreign transactions too.
> >
> > >
> > > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > > guess the facts neither that we don’t stipulate anything about query
> > > > cancellation now nor that postgres_fdw might not be cancellable in
> > > > some situations now are not a reason for not supporting query
> > > > cancellation. If it's a desirable behavior and users want it, we need
> > > > to put an effort to support it as much as possible like we’ve done in
> > > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > > only by their functionality but it would be good if we can achieve
> > > > that by combination of PostgreSQL and FDW plugins.
> > >
> > > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
> >
> > I suspect the story is somewhat different. libpq fortunately supports
> > asynchronous execution, but when it comes to canceling the foreign
> > transaction resolution I think basically all FDW plugins are in the
> > same situation at this time. We can choose whether to make it
> > cancellable or not. According to the discussion so far, it completely
> > depends on the architecture of this feature. So my point is whether
> > it's worth to have this functionality for users and whether users want
> > it, not whether postgres_fdw is ok.
> >
>
> I've thought again about the idea that once the backend failed to
> resolve a foreign transaction it leaves to a resolver process. With
> this idea, the backend process perform the 2nd phase of 2PC only once.
> If an error happens during resolution it leaves to a resolver process
> and returns an error to the client. We used to use this idea in the
> previous patches and it’s discussed sometimes.
>
> First of all, this idea doesn’t resolve the problem of error handling
> that the transaction could return an error to the client in spite of
> having been committed the local transaction. There is an argument that
> this behavior could also happen even in a single server environment
> but I guess the situation is slightly different. Basically what the
> transaction does after the commit is cleanup. An error could happen
> during cleanup but if it happens it’s likely due to a  bug of
> something wrong inside PostgreSQL or OS. On the other hand, during and
> after resolution the transaction does major works such as connecting a
> foreign server, sending an SQL, getting the result, and writing a WAL
> to remove the entry. These are more likely to happen an error.
>
> Also, with this idea, the client needs to check if the error got from
> the server is really true because the local transaction might have
> been committed. Although this could happen even in a single server
> environment how many users check that in practice? If a server
> crashes, subsequent transactions end up failing due to a network
> connection error but it seems hard to distinguish between such a real
> error and the fake error.
>
> Moreover, it’s questionable in terms of extensibility. We would not
> able to support keeping waiting for distributed transactions to
> complete even if an error happens, like synchronous replication. The
> user might want to wait in case where the failure is temporary such as
> temporary network disconnection. Trying resolution only once seems to
> have cons of both asynchronous and synchronous resolutions.
>
> So I’m thinking that with this idea the user will need to change their
> application so that it checks if the error they got is really true,
> which is cumbersome for users. Also, it seems to me we need to
> circumspectly discuss whether this idea could weaken extensibility.
>
>
> Anyway, according to the discussion, it seems to me that we got a
> consensus so far that the backend process prepares all foreign
> transactions and a resolver process is necessary to resolve in-doubt
> transaction in background. So I’ve changed the patch set as follows.
> Applying these all patches, we can support asynchronous foreign
> transaction resolution. That is, at transaction commit the backend
> process prepares all foreign transactions, and then commit the local
> transaction. After that, it returns OK of commit to the client while
> leaving the prepared foreign transaction to a resolver process. A
> resolver process fetches the foreign transactions to resolve and
> resolves them in background. Since the 2nd phase of 2PC is performed
> asynchronously a transaction that wants to see the previous
> transaction result needs to check its status.
>
> Here is brief explaination for each patches:
>
> v27-0001-Introduce-transaction-manager-for-foreign-transa.patch
>
> This commit adds the basic foreign transaction manager,
> CommitForeignTransaction, and RollbackForeignTransaction API. These
> APIs support only one-phase. With this change, FDW is able to control
> its transaction using the foreign transaction manager, not using
> XactCallback.
>
> v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch
>
> This commit implements both CommitForeignTransaction and
> RollbackForeignTransaction APIs in postgres_fdw. Note that since
> PREPARE TRANSACTION is still not supported there is nothing the user
> newly is able to do.
>
> v27-0003-Recreate-RemoveForeignServerById.patch
>
> This commit recreates RemoveForeignServerById that was removed by
> b1d32d3e3. This is necessary because we need to check if there is a
> foreign transaction involved with the foreign server that is about to
> be removed.
>
> v27-0004-Add-PrepareForeignTransaction-API.patch
>
> This commit adds prepared foreign transaction support including WAL
> logging and recovery, and PrepareForeignTransaction API. With this
> change, the user is able to do 'PREPARE TRANSACTION’ and
> 'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
> foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
> local transaction. It doesn't do anything for foreign transactions.
> Therefore, the user needs to resolve foreign transactions manually by
> executing the pg_resolve_foreign_xacts() SQL function which is also
> introduced by this commit.
>
> v27-0005-postgres_fdw-supports-prepare-API.patch
>
> This commit implements PrepareForeignTransaction API and makes
> CommitForeignTransaction and RollbackForeignTransaction supports
> two-phase commit.
>
> v27-0006-Add-GetPrepareId-API.patch
>
> This commit adds GetPrepareID API.
>
> v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch
>
> This commit introduces foreign transaction resolver and launcher
> processes. With this change, the user doesn’t need to manually execute
> pg_resolve_foreign_xacts() function to resolve foreign transactions
> prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
> Instead, a resolver process automatically resolves them in background.
>
> v27-0008-Prepare-foreign-transactions-at-commit-time.patch
>
> With this commit, the transaction prepares foreign transactions marked
> as modified at transaction commit if foreign_twophase_commit is
> ‘required’. Previously the user needs to do PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
> transparently to the user. But the transaction returns OK of commit to
> the client after committing the local transaction and notifying the
> resolver process, without waits. Foreign transactions are
> asynchronously resolved by the resolver process.
>
> v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch
>
> With this commit, the transactions started via postgres_fdw are marked
> as modified, which is necessary to use 2PC.
>
> v27-0010-Documentation-update.patch
> v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch
>
> Documentation update and regression tests.
>
> The missing piece from the previous version patch is synchronously
> transaction resolution. In the previous patch, foreign transactions
> are synchronously resolved by a resolver process. But since it's under
> discussion whether this is a good approach and I'm considering
> optimizing the logic it’s not included in the current patch set.
>
>

Cfbot reported an error. I've attached the updated version patch set
to make cfbot happy.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sun, Nov 8, 2020 at 2:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 5, 2020 at 12:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> > > <tsunakawa.takay@fujitsu.com> wrote:
> > > >
> > > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > > So what's your opinion?
> > > >
> > > > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas
(othersmay have pointed out something else too, but I don't remember), before going deeper into the code review. 
> > > >
> > > > * FDW interface
> > > > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on
tovalidate the FDW interface. 
> > > > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
> > >
> > > I guess since FDW interfaces may be affected by the feature
> > > architecture we can discuss later.
> > >
> > > > * Performance
> > > > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not
bethe first release quality.  I proposed the idea. 
> > > > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.
I want to keep Postgres's reputation.) 
> > >
> > > What is in your mind regarding the implementation of parallel prepare
> > > and commit? Given that some FDW plugins don't support asynchronous
> > > execution I guess we need to use parallel workers or something. That
> > > is, the backend process launches parallel workers to
> > > prepare/commit/rollback foreign transactions in parallel. I don't deny
> > > this approach but it'll definitely make the feature complex and needs
> > > more codes.
> > >
> > > My point is a small start and keeping simple the first version. Even
> > > if we need one or more years for this feature, I think that
> > > introducing the simple and minimum functionality as the first version
> > > to the core still has benefits. We will be able to have the
> > > opportunity to get real feedback from users and to fix bugs in the
> > > main infrastructure before making it complex. In this sense, the patch
> > > having the backend return without waits for resolution after the local
> > > commit would be a good start as the first version (i.g., up to
> > > applying v26-0006 patch). Anyway, the architecture should be
> > > extensible enough for future improvements.
> > >
> > > For the performance improvements, we will be able to support
> > > asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> > > resolver processes on one database would also help get better
> > > through-put. For the user who needs much better through-put, the user
> > > also can select not to wait for resolution after the local commit,
> > > like synchronous_commit = ‘local’ in replication.
> > >
> > > > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following
wiki.) That helps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.
I'msimply imagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> > > >
> > > > Atomic Commit of Distributed Transactions
> > > > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> > >
> > > Understood. I'll add an explanation about the message flow and disk
> > > writes to the wiki page.
> >
> > Done.
> >
> > >
> > > We need to consider the point of error handling during resolving
> > > foreign transactions too.
> > >
> > > >
> > > > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > > > guess the facts neither that we don’t stipulate anything about query
> > > > > cancellation now nor that postgres_fdw might not be cancellable in
> > > > > some situations now are not a reason for not supporting query
> > > > > cancellation. If it's a desirable behavior and users want it, we need
> > > > > to put an effort to support it as much as possible like we’ve done in
> > > > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > > > only by their functionality but it would be good if we can achieve
> > > > > that by combination of PostgreSQL and FDW plugins.
> > > >
> > > > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
> > >
> > > I suspect the story is somewhat different. libpq fortunately supports
> > > asynchronous execution, but when it comes to canceling the foreign
> > > transaction resolution I think basically all FDW plugins are in the
> > > same situation at this time. We can choose whether to make it
> > > cancellable or not. According to the discussion so far, it completely
> > > depends on the architecture of this feature. So my point is whether
> > > it's worth to have this functionality for users and whether users want
> > > it, not whether postgres_fdw is ok.
> > >
> >
> > I've thought again about the idea that once the backend failed to
> > resolve a foreign transaction it leaves to a resolver process. With
> > this idea, the backend process perform the 2nd phase of 2PC only once.
> > If an error happens during resolution it leaves to a resolver process
> > and returns an error to the client. We used to use this idea in the
> > previous patches and it’s discussed sometimes.
> >
> > First of all, this idea doesn’t resolve the problem of error handling
> > that the transaction could return an error to the client in spite of
> > having been committed the local transaction. There is an argument that
> > this behavior could also happen even in a single server environment
> > but I guess the situation is slightly different. Basically what the
> > transaction does after the commit is cleanup. An error could happen
> > during cleanup but if it happens it’s likely due to a  bug of
> > something wrong inside PostgreSQL or OS. On the other hand, during and
> > after resolution the transaction does major works such as connecting a
> > foreign server, sending an SQL, getting the result, and writing a WAL
> > to remove the entry. These are more likely to happen an error.
> >
> > Also, with this idea, the client needs to check if the error got from
> > the server is really true because the local transaction might have
> > been committed. Although this could happen even in a single server
> > environment how many users check that in practice? If a server
> > crashes, subsequent transactions end up failing due to a network
> > connection error but it seems hard to distinguish between such a real
> > error and the fake error.
> >
> > Moreover, it’s questionable in terms of extensibility. We would not
> > able to support keeping waiting for distributed transactions to
> > complete even if an error happens, like synchronous replication. The
> > user might want to wait in case where the failure is temporary such as
> > temporary network disconnection. Trying resolution only once seems to
> > have cons of both asynchronous and synchronous resolutions.
> >
> > So I’m thinking that with this idea the user will need to change their
> > application so that it checks if the error they got is really true,
> > which is cumbersome for users. Also, it seems to me we need to
> > circumspectly discuss whether this idea could weaken extensibility.
> >
> >
> > Anyway, according to the discussion, it seems to me that we got a
> > consensus so far that the backend process prepares all foreign
> > transactions and a resolver process is necessary to resolve in-doubt
> > transaction in background. So I’ve changed the patch set as follows.
> > Applying these all patches, we can support asynchronous foreign
> > transaction resolution. That is, at transaction commit the backend
> > process prepares all foreign transactions, and then commit the local
> > transaction. After that, it returns OK of commit to the client while
> > leaving the prepared foreign transaction to a resolver process. A
> > resolver process fetches the foreign transactions to resolve and
> > resolves them in background. Since the 2nd phase of 2PC is performed
> > asynchronously a transaction that wants to see the previous
> > transaction result needs to check its status.
> >
> > Here is brief explaination for each patches:
> >
> > v27-0001-Introduce-transaction-manager-for-foreign-transa.patch
> >
> > This commit adds the basic foreign transaction manager,
> > CommitForeignTransaction, and RollbackForeignTransaction API. These
> > APIs support only one-phase. With this change, FDW is able to control
> > its transaction using the foreign transaction manager, not using
> > XactCallback.
> >
> > v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch
> >
> > This commit implements both CommitForeignTransaction and
> > RollbackForeignTransaction APIs in postgres_fdw. Note that since
> > PREPARE TRANSACTION is still not supported there is nothing the user
> > newly is able to do.
> >
> > v27-0003-Recreate-RemoveForeignServerById.patch
> >
> > This commit recreates RemoveForeignServerById that was removed by
> > b1d32d3e3. This is necessary because we need to check if there is a
> > foreign transaction involved with the foreign server that is about to
> > be removed.
> >
> > v27-0004-Add-PrepareForeignTransaction-API.patch
> >
> > This commit adds prepared foreign transaction support including WAL
> > logging and recovery, and PrepareForeignTransaction API. With this
> > change, the user is able to do 'PREPARE TRANSACTION’ and
> > 'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
> > foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
> > local transaction. It doesn't do anything for foreign transactions.
> > Therefore, the user needs to resolve foreign transactions manually by
> > executing the pg_resolve_foreign_xacts() SQL function which is also
> > introduced by this commit.
> >
> > v27-0005-postgres_fdw-supports-prepare-API.patch
> >
> > This commit implements PrepareForeignTransaction API and makes
> > CommitForeignTransaction and RollbackForeignTransaction supports
> > two-phase commit.
> >
> > v27-0006-Add-GetPrepareId-API.patch
> >
> > This commit adds GetPrepareID API.
> >
> > v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch
> >
> > This commit introduces foreign transaction resolver and launcher
> > processes. With this change, the user doesn’t need to manually execute
> > pg_resolve_foreign_xacts() function to resolve foreign transactions
> > prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
> > Instead, a resolver process automatically resolves them in background.
> >
> > v27-0008-Prepare-foreign-transactions-at-commit-time.patch
> >
> > With this commit, the transaction prepares foreign transactions marked
> > as modified at transaction commit if foreign_twophase_commit is
> > ‘required’. Previously the user needs to do PREPARE TRANSACTION and
> > COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
> > transparently to the user. But the transaction returns OK of commit to
> > the client after committing the local transaction and notifying the
> > resolver process, without waits. Foreign transactions are
> > asynchronously resolved by the resolver process.
> >
> > v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch
> >
> > With this commit, the transactions started via postgres_fdw are marked
> > as modified, which is necessary to use 2PC.
> >
> > v27-0010-Documentation-update.patch
> > v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch
> >
> > Documentation update and regression tests.
> >
> > The missing piece from the previous version patch is synchronously
> > transaction resolution. In the previous patch, foreign transactions
> > are synchronously resolved by a resolver process. But since it's under
> > discussion whether this is a good approach and I'm considering
> > optimizing the logic it’s not included in the current patch set.
> >
> >
>
> Cfbot reported an error. I've attached the updated version patch set
> to make cfbot happy.

Since the previous version conflicts with the current HEAD I've
attached the rebased version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Since the previous version conflicts with the current HEAD I've
> attached the rebased version patch set.

Rebased the patch set again to the current HEAD.

The discussion of this patch is very long so here is a short summary
of the current state:

It’s still under discussion which approaches are the best for the
distributed transaction commit as a building block of built-in sharing
using foreign data wrappers.

Since we’re considering that we use this feature for built-in
sharding, the design depends on the architecture of built-in sharding.
For example, with the current patch, the PostgreSQL node that received
a COMMIT from the client works as a coordinator and it commits the
transactions using 2PC on all foreign servers involved with the
transaction. This approach would be good with the de-centralized
sharding architecture but not with centralized architecture like the
GTM node of Postgres-XC and Postgres-XL that is a dedicated component
that is responsible for transaction management. Since we don't get a
consensus on the built-in sharding architecture yet, it's still an
open question that this patch's approach is really good as a building
block of the built-in sharding.

On the other hand, this feature is not necessarily dedicated to the
built-in sharding. For example, the distributed transaction commit
through FDW is important also when atomically moving data between two
servers via FDWs. Using a dedicated process or server like GTM could
be an over solution. Having the node that received a COMMIT work as a
coordinator would be better and straight forward.

There is no noticeable TODO in the functionality so far covered by
this patch set. This patchset adds new FDW APIs to support 2PC,
introduces the global transaction manager, and implement those FDW
APIs to postgres_fdw. Also, it has regression tests and documentation.
Transactions on foreign servers involved with the distributed
transaction are committed using 2PC. Committing using 2PC is performed
asynchronously and transparently to the user. Therefore, it doesn’t
guarantee that transactions on the foreign server are also committed
when the client gets an acknowledgment of COMMIT. The patch doesn't
cover synchronous foreign transaction commit via 2PC is not covered by
this patch as we still need a discussion on the design.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, Dec 28, 2020 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Since the previous version conflicts with the current HEAD I've
> > attached the rebased version patch set.
>
> Rebased the patch set again to the current HEAD.
>
> The discussion of this patch is very long so here is a short summary
> of the current state:
>
> It’s still under discussion which approaches are the best for the
> distributed transaction commit as a building block of built-in sharing
> using foreign data wrappers.
>
> Since we’re considering that we use this feature for built-in
> sharding, the design depends on the architecture of built-in sharding.
> For example, with the current patch, the PostgreSQL node that received
> a COMMIT from the client works as a coordinator and it commits the
> transactions using 2PC on all foreign servers involved with the
> transaction. This approach would be good with the de-centralized
> sharding architecture but not with centralized architecture like the
> GTM node of Postgres-XC and Postgres-XL that is a dedicated component
> that is responsible for transaction management. Since we don't get a
> consensus on the built-in sharding architecture yet, it's still an
> open question that this patch's approach is really good as a building
> block of the built-in sharding.
>
> On the other hand, this feature is not necessarily dedicated to the
> built-in sharding. For example, the distributed transaction commit
> through FDW is important also when atomically moving data between two
> servers via FDWs. Using a dedicated process or server like GTM could
> be an over solution. Having the node that received a COMMIT work as a
> coordinator would be better and straight forward.
>
> There is no noticeable TODO in the functionality so far covered by
> this patch set. This patchset adds new FDW APIs to support 2PC,
> introduces the global transaction manager, and implement those FDW
> APIs to postgres_fdw. Also, it has regression tests and documentation.
> Transactions on foreign servers involved with the distributed
> transaction are committed using 2PC. Committing using 2PC is performed
> asynchronously and transparently to the user. Therefore, it doesn’t
> guarantee that transactions on the foreign server are also committed
> when the client gets an acknowledgment of COMMIT. The patch doesn't
> cover synchronous foreign transaction commit via 2PC is not covered by
> this patch as we still need a discussion on the design.
>

I've attached the rebased patches to make cfbot happy.


Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:
Hi,
For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :

However these functions are not neither committed nor aborted at

I think the double negation was not intentional. Should be 'are neither ...'

For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

+       fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);

For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

+        * we don't need to anything for this participant because all foreign

'need to' -> 'need to do'

+   else if (TransactionIdDidAbort(xid))
+       return FDWXACT_STATUS_ABORTING;
+
the 'else' can be omitted since the preceding if would return.

+   if (max_prepared_foreign_xacts <= 0)

I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).

Cheers

On Wed, Jan 6, 2021 at 5:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, Dec 28, 2020 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Since the previous version conflicts with the current HEAD I've
> > attached the rebased version patch set.
>
> Rebased the patch set again to the current HEAD.
>
> The discussion of this patch is very long so here is a short summary
> of the current state:
>
> It’s still under discussion which approaches are the best for the
> distributed transaction commit as a building block of built-in sharing
> using foreign data wrappers.
>
> Since we’re considering that we use this feature for built-in
> sharding, the design depends on the architecture of built-in sharding.
> For example, with the current patch, the PostgreSQL node that received
> a COMMIT from the client works as a coordinator and it commits the
> transactions using 2PC on all foreign servers involved with the
> transaction. This approach would be good with the de-centralized
> sharding architecture but not with centralized architecture like the
> GTM node of Postgres-XC and Postgres-XL that is a dedicated component
> that is responsible for transaction management. Since we don't get a
> consensus on the built-in sharding architecture yet, it's still an
> open question that this patch's approach is really good as a building
> block of the built-in sharding.
>
> On the other hand, this feature is not necessarily dedicated to the
> built-in sharding. For example, the distributed transaction commit
> through FDW is important also when atomically moving data between two
> servers via FDWs. Using a dedicated process or server like GTM could
> be an over solution. Having the node that received a COMMIT work as a
> coordinator would be better and straight forward.
>
> There is no noticeable TODO in the functionality so far covered by
> this patch set. This patchset adds new FDW APIs to support 2PC,
> introduces the global transaction manager, and implement those FDW
> APIs to postgres_fdw. Also, it has regression tests and documentation.
> Transactions on foreign servers involved with the distributed
> transaction are committed using 2PC. Committing using 2PC is performed
> asynchronously and transparently to the user. Therefore, it doesn’t
> guarantee that transactions on the foreign server are also committed
> when the client gets an acknowledgment of COMMIT. The patch doesn't
> cover synchronous foreign transaction commit via 2PC is not covered by
> this patch as we still need a discussion on the design.
>

I've attached the rebased patches to make cfbot happy.


Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
    Size        size;

    /* Size for foreign transaction information array */
    size = offsetof(FdwXactCtlData, fdwxacts);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXact)));
    size = MAXALIGN(size);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXactData)));

    return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> +       fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> +        * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> +   else if (TransactionIdDidAbort(xid))
> +       return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> +   if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an
error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:
Hi,
For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

+   bool        have_notwophase = false;

Maybe name the variable have_no_twophase so that it is easier to read.

+    * Two-phase commit is not required if the number of servers performed

performed -> performing

+                errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
+                errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));

The lines are really long. Please wrap into more lines.



On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
    Size        size;

    /* Size for foreign transaction information array */
    size = offsetof(FdwXactCtlData, fdwxacts);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXact)));
    size = MAXALIGN(size);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXactData)));

    return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> +       fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> +        * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> +   else if (TransactionIdDidAbort(xid))
> +       return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> +   if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:
For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :

+   entry->changing_xact_state = true;
...
+   entry->changing_xact_state = abort_cleanup_failure;

I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true, and later being assigned again.

For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :

bq. This commits introduces to new background processes: foreign

commits introduces to new -> commit introduces two new

+FdwXactExistsXid(TransactionId xid)

Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group

Please correct year in the next patch set.

+FdwXactLauncherRequestToLaunch(void)

Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)

Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

+fdwxact_launch_resolver(Oid dbid)

The above method is not in camel case. It would be better if method names are consistent (in casing).

+                errmsg("out of foreign transaction resolver slots"),
+                errhint("You might need to increase max_foreign_transaction_resolvers.")));

It would be nice to include the value of max_foreign_xact_resolvers

For fdwxact_resolver_onexit():

+       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+       fdwxact->locking_backend = InvalidBackendId;
+       LWLockRelease(FdwXactLock);

There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the for loop and released coming out of the for loop.

+FXRslvLoop(void)

Please use Resolver instead of Rslv

+           FdwXactResolveFdwXacts(held_fdwxacts, nheld);

Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Cheers

On Thu, Jan 14, 2021 at 11:04 AM Zhihong Yu <zyu@yugabyte.com> wrote:
Hi,
For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

+   bool        have_notwophase = false;

Maybe name the variable have_no_twophase so that it is easier to read.

+    * Two-phase commit is not required if the number of servers performed

performed -> performing

+                errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
+                errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));

The lines are really long. Please wrap into more lines.



On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
    Size        size;

    /* Size for foreign transaction information array */
    size = offsetof(FdwXactCtlData, fdwxacts);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXact)));
    size = MAXALIGN(size);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXactData)));

    return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> +       fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> +        * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> +   else if (TransactionIdDidAbort(xid))
> +       return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> +   if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> +   bool        have_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.

Fixed.

>
> +    * Two-phase commit is not required if the number of servers performed
>
> performed -> performing

Fixed.

>
> +                errmsg("cannot process a distributed transaction that has operated on a foreign server that does not
supporttwo-phase commit protocol"),
 
> +                errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers
whichare not capable of two-phase commit")));
 
>
> The lines are really long. Please wrap into more lines.

Hmm, we can do that but if we do that, it makes grepping by the error
message hard. Please refer to the documentation about the formatting
guideline[1]:

Limit line lengths so that the code is readable in an 80-column
window. (This doesn't mean that you must never go past 80 columns. For
instance, breaking a long error message string in arbitrary places
just to keep the code within 80 columns is probably not a net gain in
readability.)

These changes have been made in the local branch. I'll post the
updated patch set after incorporating all the comments.

Regards,

[1] https://www.postgresql.org/docs/devel/source-format.html

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:
Hi,
For v32-0004-Add-PrepareForeignTransaction-API.patch :

+ * Whenever a foreign transaction is processed, the corresponding FdwXact
+ * entry is update.     To avoid holding the lock during transaction processing
+ * which may take an unpredicatable time the in-memory data of foreign

entry is update -> entry is updated

unpredictable -> unpredictable

+   int         nlefts = 0;

nlefts -> nremaining

+       elog(DEBUG1, "left %u foreign transactions", nlefts);

The message can be phrased as "%u foreign transactions remaining"

+FdwXactResolveFdwXacts(int *fdwxact_idxs, int nfdwxacts)

Fdw and Xact are repeated. Seems one should suffice. How about naming the method FdwXactResolveTransactions() ?
Similar comment for FdwXactResolveOneFdwXact(FdwXact fdwxact)

For get_fdwxact():

+       /* This entry matches the condition */
+       found = true;
+       break;

Instead of breaking and returning, you can return within the loop directly.

Cheers

On Thu, Jan 14, 2021 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> +   bool        have_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.

Fixed.

>
> +    * Two-phase commit is not required if the number of servers performed
>
> performed -> performing

Fixed.

>
> +                errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
> +                errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));
>
> The lines are really long. Please wrap into more lines.

Hmm, we can do that but if we do that, it makes grepping by the error
message hard. Please refer to the documentation about the formatting
guideline[1]:

Limit line lengths so that the code is readable in an 80-column
window. (This doesn't mean that you must never go past 80 columns. For
instance, breaking a long error message string in arbitrary places
just to keep the code within 80 columns is probably not a net gain in
readability.)

These changes have been made in the local branch. I'll post the
updated patch set after incorporating all the comments.

Regards,

[1] https://www.postgresql.org/docs/devel/source-format.html

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>
> +   entry->changing_xact_state = true;
> ...
> +   entry->changing_xact_state = abort_cleanup_failure;
>
> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true,
andlater being assigned again.
 

Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

    /*
     * If connection is before starting transaction or is already unsalvageable,
     * do only the cleanup and don't touch it further.
     */
    if (entry->changing_xact_state)
    {
        pgfdw_cleanup_after_transaction(entry);
        return;
    }

>
> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>
> bq. This commits introduces to new background processes: foreign
>
> commits introduces to new -> commit introduces two new

Fixed.

>
> +FdwXactExistsXid(TransactionId xid)
>
> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?

>
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> Please correct year in the next patch set.

Fixed.

>
> +FdwXactLauncherRequestToLaunch(void)
>
> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

Agreed. How about FdwXactRequestToLaunchResolver()?

>
> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> +Size
> +FdwXactRslvShmemSize(void)
>
> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.

>
> +fdwxact_launch_resolver(Oid dbid)
>
> The above method is not in camel case. It would be better if method names are consistent (in casing).

Fixed.

>
> +                errmsg("out of foreign transaction resolver slots"),
> +                errhint("You might need to increase max_foreign_transaction_resolvers.")));
>
> It would be nice to include the value of max_foreign_xact_resolvers

I agree it would be nice but looking at other code we don't include
the value in this kind of messages.

>
> For fdwxact_resolver_onexit():
>
> +       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> +       fdwxact->locking_backend = InvalidBackendId;
> +       LWLockRelease(FdwXactLock);
>
> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the
forloop and released coming out of the for loop.
 

Agreed.

>
> +FXRslvLoop(void)
>
> Please use Resolver instead of Rslv

Fixed.

>
> +           FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>
> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Fixed.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:
Hi, Masahiko-san:

bq. How about FdwXactRequestToLaunchResolver()?

Sounds good to me.

bq. But there is already a function named FdwXactExists()

Then we can leave the function name as it is.

Cheers

On Sun, Jan 17, 2021 at 9:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>
> +   entry->changing_xact_state = true;
> ...
> +   entry->changing_xact_state = abort_cleanup_failure;
>
> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true, and later being assigned again.

Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

    /*
     * If connection is before starting transaction or is already unsalvageable,
     * do only the cleanup and don't touch it further.
     */
    if (entry->changing_xact_state)
    {
        pgfdw_cleanup_after_transaction(entry);
        return;
    }

>
> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>
> bq. This commits introduces to new background processes: foreign
>
> commits introduces to new -> commit introduces two new

Fixed.

>
> +FdwXactExistsXid(TransactionId xid)
>
> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?

>
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> Please correct year in the next patch set.

Fixed.

>
> +FdwXactLauncherRequestToLaunch(void)
>
> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

Agreed. How about FdwXactRequestToLaunchResolver()?

>
> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> +Size
> +FdwXactRslvShmemSize(void)
>
> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.

>
> +fdwxact_launch_resolver(Oid dbid)
>
> The above method is not in camel case. It would be better if method names are consistent (in casing).

Fixed.

>
> +                errmsg("out of foreign transaction resolver slots"),
> +                errhint("You might need to increase max_foreign_transaction_resolvers.")));
>
> It would be nice to include the value of max_foreign_xact_resolvers

I agree it would be nice but looking at other code we don't include
the value in this kind of messages.

>
> For fdwxact_resolver_onexit():
>
> +       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> +       fdwxact->locking_backend = InvalidBackendId;
> +       LWLockRelease(FdwXactLock);
>
> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the for loop and released coming out of the for loop.

Agreed.

>
> +FXRslvLoop(void)
>
> Please use Resolver instead of Rslv

Fixed.

>
> +           FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>
> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Fixed.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/01/18 14:54, Masahiko Sawada wrote:
> On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>>
>> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>>
>> +   entry->changing_xact_state = true;
>> ...
>> +   entry->changing_xact_state = abort_cleanup_failure;
>>
>> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true,
andlater being assigned again.
 
> 
> Because postgresRollbackForeignTransaction() can get called again in
> case where an error occurred during aborting and cleanup the
> transaction. For example, if an error occurred when executing ABORT
> TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
> postgresRollbackForeignTransaction() will get called again while
> entry->changing_xact_state is still true. Then the entry will be
> caught by the following condition and cleaned up:
> 
>      /*
>       * If connection is before starting transaction or is already unsalvageable,
>       * do only the cleanup and don't touch it further.
>       */
>      if (entry->changing_xact_state)
>      {
>          pgfdw_cleanup_after_transaction(entry);
>          return;
>      }
> 
>>
>> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>>
>> bq. This commits introduces to new background processes: foreign
>>
>> commits introduces to new -> commit introduces two new
> 
> Fixed.
> 
>>
>> +FdwXactExistsXid(TransactionId xid)
>>
>> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.
> 
> But there is already a function named FdwXactExists()?
> 
> bool
> FdwXactExists(Oid dbid, Oid serverid, Oid userid)
> 
> As far as I read other code, we already have such functions that have
> the same functionality but have different arguments. For instance,
> SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
> we can leave as it is but is it better to have like
> FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?
> 
>>
>> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>>
>> Please correct year in the next patch set.
> 
> Fixed.
> 
>>
>> +FdwXactLauncherRequestToLaunch(void)
>>
>> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.
> 
> Agreed. How about FdwXactRequestToLaunchResolver()?
> 
>>
>> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
>> +Size
>> +FdwXactRslvShmemSize(void)
>>
>> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
>> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)
> 
> Agreed. I realized that these functions are the launcher's function,
> not resolver's. So I'd change to FdwXactLauncherShmemSize() and
> FdwXactLauncherShmemInit() respectively.
> 
>>
>> +fdwxact_launch_resolver(Oid dbid)
>>
>> The above method is not in camel case. It would be better if method names are consistent (in casing).
> 
> Fixed.
> 
>>
>> +                errmsg("out of foreign transaction resolver slots"),
>> +                errhint("You might need to increase max_foreign_transaction_resolvers.")));
>>
>> It would be nice to include the value of max_foreign_xact_resolvers
> 
> I agree it would be nice but looking at other code we don't include
> the value in this kind of messages.
> 
>>
>> For fdwxact_resolver_onexit():
>>
>> +       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
>> +       fdwxact->locking_backend = InvalidBackendId;
>> +       LWLockRelease(FdwXactLock);
>>
>> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to
thefor loop and released coming out of the for loop.
 
> 
> Agreed.
> 
>>
>> +FXRslvLoop(void)
>>
>> Please use Resolver instead of Rslv
> 
> Fixed.
> 
>>
>> +           FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>>
>> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.
> 
> Fixed.

You fixed some issues. But maybe you forgot to attach the latest patches?

I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independent from 2PC
feature.If there are such changes, IMO we can apply them in advance, and which would make the patches simpler.
 

+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not commit transaction on server %s",
+                               frstate->server->servername)));

You changed the code this way because you want to include the server name in the error message? I agree that it's
helpfulto report also the server name that caused an error. OTOH, since this change gets rid of call to
pgfdw_rerport_error()for the returned PGresult, the reported error message contains less information. If this
understandingis right, I don't think that this change is an improvement.
 

Instead, if the server name should be included in the error message, pgfdw_report_error() should be changed so that it
alsoreports the server name? If we do that, the server name is reported not only when COMMIT fails but also when other
commandsfail.
 

Of course, if this change is not essential, we can skip doing this in the first version.

-    /*
-     * Regardless of the event type, we can now mark ourselves as out of the
-     * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-     * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-     */
-    xact_got_connection = false;

With this change, xact_got_connection seems to never be set to false. Doesn't this break pgfdw_subxact_callback() using
xact_got_connection?

+    /* Also reset cursor numbering for next transaction */
+    cursor_number = 0;

Originally this variable is reset to 0 once per transaction end. But with the patch, it's reset to 0 every time when a
foreigntransaction ends at each connection. This change would be harmless fortunately in practice, but seems not right
theoretically.

This makes me wonder if new FDW API is not good at handling the case where some operations need to be performed once
pertransaction end.
 

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
>
>
> You fixed some issues. But maybe you forgot to attach the latest patches?

Yes, I've attached the updated patches.

>
> I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independent from 2PC
feature.If there are such changes, IMO we can apply them in advance, and which would make the patches simpler. 

Thank you for reviewing the patches!

>
> +       if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +               ereport(ERROR, (errmsg("could not commit transaction on server %s",
> +                                                          frstate->server->servername)));
>
> You changed the code this way because you want to include the server name in the error message? I agree that it's
helpfulto report also the server name that caused an error. OTOH, since this change gets rid of call to
pgfdw_rerport_error()for the returned PGresult, the reported error message contains less information. If this
understandingis right, I don't think that this change is an improvement. 

Right. It's better to use do_sql_command() instead.

> Instead, if the server name should be included in the error message, pgfdw_report_error() should be changed so that
italso reports the server name? If we do that, the server name is reported not only when COMMIT fails but also when
othercommands fail. 
>
> Of course, if this change is not essential, we can skip doing this in the first version.

Yes, I think it's not essential for now. We can improve it later if we want.

>
> -       /*
> -        * Regardless of the event type, we can now mark ourselves as out of the
> -        * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
> -        * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
> -        */
> -       xact_got_connection = false;
>
> With this change, xact_got_connection seems to never be set to false. Doesn't this break pgfdw_subxact_callback()
usingxact_got_connection? 

I think xact_got_connection is set to false in
pgfdw_cleanup_after_transaction() that is called at the end of each
foreign transaction (i.g., in postgresCommitForeignTransaction() and
postgresRollbackForeignTransaction()).

But as you're concerned below, it's reset for each foreign transaction
end rather than the parent's transaction end.

>
> +       /* Also reset cursor numbering for next transaction */
> +       cursor_number = 0;
>
> Originally this variable is reset to 0 once per transaction end. But with the patch, it's reset to 0 every time when
aforeign transaction ends at each connection. This change would be harmless fortunately in practice, but seems not
righttheoretically. 
>
> This makes me wonder if new FDW API is not good at handling the case where some operations need to be performed once
pertransaction end. 

I think that the problem comes from the fact that FDW needs to use
both SubXactCallback and new FDW API.

If we want to perform some operations at the end of the top
transaction per FDW, not per foreign transaction, we will either still
need to use XactCallback or need to rethink the FDW API design. But
given that we call commit and rollback FDW API for only foreign
servers that actually started a transaction, I’m not sure if there are
such operations in practice. IIUC there is not at least from the
normal (not-sub) transaction termination perspective.

IIUC xact_got_transaction is used to skip iterating over all cached
connections to find open remote (sub) transactions. This is not
necessary anymore at least from the normal transaction termination
perspective. So maybe we can improve it so that it tracks whether any
of the cached connections opened a subtransaction. That is, we set it
true when we created a savepoint on any connections and set it false
at the end of pgfdw_subxact_callback() if we see that xact_depth of
all cached entry is less than or equal to 1 after iterating over all
entries.

Regarding cursor_number, it essentially needs to be unique at least
within a transaction so we can manage it per transaction or per
connection. But the current postgres_fdw rather ensure uniqueness
across all connections. So it seems to me that this can be fixed by
making individual connection have cursor_number and resetting it in
pgfdw_cleanup_after_transaction(). I think this can be in a separate
patch. Or it also could solve this problem that we terminate
subtransactions via a FDW API but I don't think it's a good idea.

What do you think?


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, Jan 16, 2021 at 1:39 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For v32-0004-Add-PrepareForeignTransaction-API.patch :
>
> + * Whenever a foreign transaction is processed, the corresponding FdwXact
> + * entry is update.     To avoid holding the lock during transaction processing
> + * which may take an unpredicatable time the in-memory data of foreign
>
> entry is update -> entry is updated
>
> unpredictable -> unpredictable

Fixed.
¨
>
> +   int         nlefts = 0;
>
> nlefts -> nremaining
>
> +       elog(DEBUG1, "left %u foreign transactions", nlefts);
>
> The message can be phrased as "%u foreign transactions remaining"

Fixed.

>
> +FdwXactResolveFdwXacts(int *fdwxact_idxs, int nfdwxacts)
>
> Fdw and Xact are repeated. Seems one should suffice. How about naming the method FdwXactResolveTransactions() ?
> Similar comment for FdwXactResolveOneFdwXact(FdwXact fdwxact)

Agreed. I changed to ResolveFdwXacts() and ResolveOneFdwXact()
respectively to avoid a long function name.

>
> For get_fdwxact():
>
> +       /* This entry matches the condition */
> +       found = true;
> +       break;
>
> Instead of breaking and returning, you can return within the loop directly.

Fixed.

Those changes are incorporated into the latest version patches[1] I
submitted today.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoBYyA5O%2BFPN4Cs9YWiKjq319BvF5fYmKNsFTZfwTcWjQw%40mail.gmail.com

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/01/27 14:08, Masahiko Sawada wrote:
> On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>> You fixed some issues. But maybe you forgot to attach the latest patches?
> 
> Yes, I've attached the updated patches.

Thanks for updating the patch! I tried to review 0001 and 0002 as the self-contained change.

+ * An FDW that implements both commit and rollback APIs can request to register
+ * the foreign transaction by FdwXactRegisterXact() to participate it to a
+ * group of distributed tranasction.  The registered foreign transactions are
+ * identified by OIDs of server and user.

I'm afraid that the combination of OIDs of server and user is not unique. IOW, more than one foreign transactions can
havethe same combination of OIDs of server and user. For example, the following two SELECT queries start the different
foreigntransactions but their user OID is the same. OID of user mapping should be used instead of OID of user?
 

     CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
     CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 'postgres');
     CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres');
     CREATE TABLE t(i int);
     CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't');
     BEGIN;
     SELECT * FROM ft;
     DROP USER MAPPING FOR postgres SERVER loopback ;
     SELECT * FROM ft;
     COMMIT;

+    /* Commit foreign transactions if any */
+    AtEOXact_FdwXact(true);

Don't we need to pass XACT_EVENT_PARALLEL_PRE_COMMIT or XACT_EVENT_PRE_COMMIT flag? Probably we don't need to do this
ifpostgres_fdw is only user of this new API. But if we make this new API generic one, such flags seem necessary so that
someforeign data wrappers might have different behaviors for those flags.
 

Because of the same reason as above, AtEOXact_FdwXact() should also be called after
CallXactCallbacks(is_parallel_worker? XACT_EVENT_PARALLEL_COMMIT : XACT_EVENT_COMMIT)?
 

+    /*
+     * Abort foreign transactions if any.  This needs to be done before marking
+     * this transaction as not running since FDW's transaction callbacks might
+     * assume this transaction is still in progress.
+     */
+    AtEOXact_FdwXact(false);

Same as above.

+/*
+ * This function is called at PREPARE TRANSACTION.  Since we don't support
+ * preparing foreign transactions yet, raise an error if the local transaction
+ * has any foreign transaction.
+ */
+void
+AtPrepare_FdwXact(void)
+{
+    if (FdwXactParticipants != NIL)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot PREPARE a transaction that has operated on foreign tables")));
+}

This means that some foreign data wrappers suppporting the prepare transaction (though I'm not sure if such wappers
actuallyexist or not) cannot use the new API? If we want to allow those wrappers to use new API, AtPrepare_FdwXact()
shouldcall the prepare callback and each wrapper should emit an error within the callback if necessary.
 

+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
+
+        if (fdw_part->server->serverid == serverid &&
+            fdw_part->usermapping->userid == userid)

Isn't this ineffecient when starting lots of foreign transactions because we need to scan all the entries in the list
everytime?
 

+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
+{
+    bool        found;
+    ConnCacheEntry *entry;
+    ConnCacheKey key;
+
+    /* First time through, initialize connection cache hashtable */
+    if (ConnectionHash == NULL)
+    {
+        HASHCTL        ctl;
+
+        ctl.keysize = sizeof(ConnCacheKey);
+        ctl.entrysize = sizeof(ConnCacheEntry);
+        ConnectionHash = hash_create("postgres_fdw connections", 8,
+                                     &ctl,
+                                     HASH_ELEM | HASH_BLOBS);

Currently ConnectionHash is created under TopMemoryContext. With the patch, since GetConnectionCacheEntry() can be
calledin other places, ConnectionHash may be created under the memory context other than TopMemoryContext? If so,
that'ssafe?
 

-        if (PQstatus(entry->conn) != CONNECTION_OK ||
-            PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-            entry->changing_xact_state ||
-            entry->invalidated)
...
+    if (PQstatus(entry->conn) != CONNECTION_OK ||
+        PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+        entry->changing_xact_state)

Why did you get rid of the condition "entry->invalidated"?


> 
>>
>> I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independent from 2PC
feature.If there are such changes, IMO we can apply them in advance, and which would make the patches simpler.
 
> 
> Thank you for reviewing the patches!
> 
>>
>> +       if (PQresultStatus(res) != PGRES_COMMAND_OK)
>> +               ereport(ERROR, (errmsg("could not commit transaction on server %s",
>> +                                                          frstate->server->servername)));
>>
>> You changed the code this way because you want to include the server name in the error message? I agree that it's
helpfulto report also the server name that caused an error. OTOH, since this change gets rid of call to
pgfdw_rerport_error()for the returned PGresult, the reported error message contains less information. If this
understandingis right, I don't think that this change is an improvement.
 
> 
> Right. It's better to use do_sql_command() instead.
> 
>> Instead, if the server name should be included in the error message, pgfdw_report_error() should be changed so that
italso reports the server name? If we do that, the server name is reported not only when COMMIT fails but also when
othercommands fail.
 
>>
>> Of course, if this change is not essential, we can skip doing this in the first version.
> 
> Yes, I think it's not essential for now. We can improve it later if we want.
> 
>>
>> -       /*
>> -        * Regardless of the event type, we can now mark ourselves as out of the
>> -        * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
>> -        * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
>> -        */
>> -       xact_got_connection = false;
>>
>> With this change, xact_got_connection seems to never be set to false. Doesn't this break pgfdw_subxact_callback()
usingxact_got_connection?
 
> 
> I think xact_got_connection is set to false in
> pgfdw_cleanup_after_transaction() that is called at the end of each
> foreign transaction (i.g., in postgresCommitForeignTransaction() and
> postgresRollbackForeignTransaction()).
> 
> But as you're concerned below, it's reset for each foreign transaction
> end rather than the parent's transaction end.
> 
>>
>> +       /* Also reset cursor numbering for next transaction */
>> +       cursor_number = 0;
>>
>> Originally this variable is reset to 0 once per transaction end. But with the patch, it's reset to 0 every time when
aforeign transaction ends at each connection. This change would be harmless fortunately in practice, but seems not
righttheoretically.
 
>>
>> This makes me wonder if new FDW API is not good at handling the case where some operations need to be performed once
pertransaction end.
 
> 
> I think that the problem comes from the fact that FDW needs to use
> both SubXactCallback and new FDW API.
> 
> If we want to perform some operations at the end of the top
> transaction per FDW, not per foreign transaction, we will either still
> need to use XactCallback or need to rethink the FDW API design. But
> given that we call commit and rollback FDW API for only foreign
> servers that actually started a transaction, I’m not sure if there are
> such operations in practice. IIUC there is not at least from the
> normal (not-sub) transaction termination perspective.

One feature in my mind that may not match with this new API is to perform transaction commits on multiple servers in
parallel.That's something like the following. As far as I can recall, another proposed version of 2pc on postgres_fdw
patchincluded that feature. If we want to implement this to increase the performance of transaction commit in the
future,I'm afraid that new API will prevent that. 

     foreach(foreign transactions)
         send commit command

     foreach(foreign transactions)
         wait for reply of commit

On second thought, new per-transaction commit/rollback callback is essential when users or the resolver process want to
resolvethe specifed foreign transaction, but not essential when backends commit/rollback foreign transactions. That is,
evenif we add per-transaction new API for users and resolver process, backends can still use CallXactCallbacks() when
theycommit/rollback foreign transactions. Is this understanding right?
 


> 
> IIUC xact_got_transaction is used to skip iterating over all cached
> connections to find open remote (sub) transactions. This is not
> necessary anymore at least from the normal transaction termination
> perspective. So maybe we can improve it so that it tracks whether any
> of the cached connections opened a subtransaction. That is, we set it
> true when we created a savepoint on any connections and set it false
> at the end of pgfdw_subxact_callback() if we see that xact_depth of
> all cached entry is less than or equal to 1 after iterating over all
> entries.

OK.


> Regarding cursor_number, it essentially needs to be unique at least
> within a transaction so we can manage it per transaction or per
> connection. But the current postgres_fdw rather ensure uniqueness
> across all connections. So it seems to me that this can be fixed by
> making individual connection have cursor_number and resetting it in
> pgfdw_cleanup_after_transaction(). I think this can be in a separate
> patch.

Maybe, so let's work on this later, at least after we confirm that
this change is really necessary.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2021/01/27 14:08, Masahiko Sawada wrote:
> > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> > <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >> You fixed some issues. But maybe you forgot to attach the latest patches?
> >
> > Yes, I've attached the updated patches.
>
> Thanks for updating the patch! I tried to review 0001 and 0002 as the self-contained change.
>
> + * An FDW that implements both commit and rollback APIs can request to register
> + * the foreign transaction by FdwXactRegisterXact() to participate it to a
> + * group of distributed tranasction.  The registered foreign transactions are
> + * identified by OIDs of server and user.
>
> I'm afraid that the combination of OIDs of server and user is not unique. IOW, more than one foreign transactions can
havethe same combination of OIDs of server and user. For example, the following two SELECT queries start the different
foreigntransactions but their user OID is the same. OID of user mapping should be used instead of OID of user? 
>
>      CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
>      CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 'postgres');
>      CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres');
>      CREATE TABLE t(i int);
>      CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't');
>      BEGIN;
>      SELECT * FROM ft;
>      DROP USER MAPPING FOR postgres SERVER loopback ;
>      SELECT * FROM ft;
>      COMMIT;

Good catch. I've considered using user mapping OID or a pair of user
mapping OID and server OID as a key of foreign transactions but I
think it also has a problem if an FDW caches the connection by pair of
server OID and user OID whereas the core identifies them by user
mapping OID. For instance, mysql_fdw manages connections by pair of
server OID and user OID.

For example, let's consider the following execution:

BEGIN;
SET ROLE user_A;
INSERT INTO ft1 VALUES (1);
SET ROLE user_B;
INSERT INTO ft1 VALUES (1);
COMMIT;

Suppose that an FDW identifies the connections by {server OID, user
OID} and the core GTM identifies the transactions by user mapping OID,
and user_A and user_B use the public user mapping to connect server_X.
In the FDW, there are two connections identified by {user_A, sever_X}
and {user_B, server_X} respectively, and therefore opens two
transactions on each connection, while GTM has only one FdwXact entry
because the two connections refer to the same user mapping OID. As a
result, at the end of the transaction, GTM ends only one foreign
transaction, leaving another one.

Using user mapping OID seems natural to me but I'm concerned that
changing role in the middle of transaction is likely to happen than
dropping the public user mapping but not sure. We would need to find
more better way.

>
> +       /* Commit foreign transactions if any */
> +       AtEOXact_FdwXact(true);
>
> Don't we need to pass XACT_EVENT_PARALLEL_PRE_COMMIT or XACT_EVENT_PRE_COMMIT flag? Probably we don't need to do this
ifpostgres_fdw is only user of this new API. But if we make this new API generic one, such flags seem necessary so that
someforeign data wrappers might have different behaviors for those flags. 
>
> Because of the same reason as above, AtEOXact_FdwXact() should also be called after
CallXactCallbacks(is_parallel_worker? XACT_EVENT_PARALLEL_COMMIT : XACT_EVENT_COMMIT)? 

Agreed.

In AtEOXact_FdwXact() we call either CommitForeignTransaction() or
RollbackForeignTransaction() with FDWXACT_FLAG_ONEPHASE flag for each
foreign transaction. So for example in commit case, we will call new
FDW APIs in the following order:

1. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_PRE_COMMIT
flag and FDWXACT_FLAG_ONEPHASE flag for each foreign transaction.
2. Commit locally.
3. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_COMMIT
flag and FDWXACT_FLAG_ONEPHASE flag for each foreign transaction.

In the future when we have a new FDW API to prepare foreign
transaction, the sequence will be:

1. Call PrepareForeignTransaction() for each foreign transaction.
2. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_PRE_COMMIT
flag for each foreign transaction.
3. Commit locally.
4. Call CommitForeignTransaction() with XACT_EVENT_PARALLEL_COMMIT
flag for each foreign transaction.

So we expect FDW that wants to support 2PC not to commit foreign
transaction if CommitForeignTransaction() is called with
XACT_EVENT_PARALLEL_PRE_COMMIT flag and no FDWXACT_FLAG_ONEPHASE flag.

>
> +       /*
> +        * Abort foreign transactions if any.  This needs to be done before marking
> +        * this transaction as not running since FDW's transaction callbacks might
> +        * assume this transaction is still in progress.
> +        */
> +       AtEOXact_FdwXact(false);
>
> Same as above.
>
> +/*
> + * This function is called at PREPARE TRANSACTION.  Since we don't support
> + * preparing foreign transactions yet, raise an error if the local transaction
> + * has any foreign transaction.
> + */
> +void
> +AtPrepare_FdwXact(void)
> +{
> +       if (FdwXactParticipants != NIL)
> +               ereport(ERROR,
> +                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                                errmsg("cannot PREPARE a transaction that has operated on foreign tables")));
> +}
>
> This means that some foreign data wrappers suppporting the prepare transaction (though I'm not sure if such wappers
actuallyexist or not) cannot use the new API? If we want to allow those wrappers to use new API, AtPrepare_FdwXact()
shouldcall the prepare callback and each wrapper should emit an error within the callback if necessary. 

I think if we support the prepare callback and allow FDWs to prepare
foreign transactions, we have to call CommitForeignTransaction() on
COMMIT PREPARED for foreign transactions that are associated with the
local prepared transaction. But how can we know which foreign
transactions are? Even a client who didn’t do PREPARE TRANSACTION
could do COMMIT PREPARED. We would need to store the information of
which foreign transactions are associated with the local transaction
somewhere. The 0004 patch introduces WAL logging along with prepare
API and we store that information to a WAL record. I think it’s better
at this time to disallow PREPARE TRANSACTION when at least one foreign
transaction is registered via FDW API.

>
> +       foreach(lc, FdwXactParticipants)
> +       {
> +               FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
> +
> +               if (fdw_part->server->serverid == serverid &&
> +                       fdw_part->usermapping->userid == userid)
>
> Isn't this ineffecient when starting lots of foreign transactions because we need to scan all the entries in the list
everytime? 

Agreed. I'll change it to a hash map.

>
> +static ConnCacheEntry *
> +GetConnectionCacheEntry(Oid umid)
> +{
> +       bool            found;
> +       ConnCacheEntry *entry;
> +       ConnCacheKey key;
> +
> +       /* First time through, initialize connection cache hashtable */
> +       if (ConnectionHash == NULL)
> +       {
> +               HASHCTL         ctl;
> +
> +               ctl.keysize = sizeof(ConnCacheKey);
> +               ctl.entrysize = sizeof(ConnCacheEntry);
> +               ConnectionHash = hash_create("postgres_fdw connections", 8,
> +                                                                        &ctl,
> +                                                                        HASH_ELEM | HASH_BLOBS);
>
> Currently ConnectionHash is created under TopMemoryContext. With the patch, since GetConnectionCacheEntry() can be
calledin other places, ConnectionHash may be created under the memory context other than TopMemoryContext? If so,
that'ssafe? 

hash_create() creates a hash map under TopMemoryContext unless
HASH_CONTEXT is specified. So I think ConnectionHash is still created
in the same memory context.

>
> -               if (PQstatus(entry->conn) != CONNECTION_OK ||
> -                       PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> -                       entry->changing_xact_state ||
> -                       entry->invalidated)
> ...
> +       if (PQstatus(entry->conn) != CONNECTION_OK ||
> +               PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> +               entry->changing_xact_state)
>
> Why did you get rid of the condition "entry->invalidated"?

My bad. I'll fix it.

> >
> > If we want to perform some operations at the end of the top
> > transaction per FDW, not per foreign transaction, we will either still
> > need to use XactCallback or need to rethink the FDW API design. But
> > given that we call commit and rollback FDW API for only foreign
> > servers that actually started a transaction, I’m not sure if there are
> > such operations in practice. IIUC there is not at least from the
> > normal (not-sub) transaction termination perspective.
>
> One feature in my mind that may not match with this new API is to perform transaction commits on multiple servers in
parallel.That's something like the following. As far as I can recall, another proposed version of 2pc on postgres_fdw
patchincluded that feature. If we want to implement this to increase the performance of transaction commit in the
future,I'm afraid that new API will prevent that. 
>
>      foreach(foreign transactions)
>          send commit command
>
>      foreach(foreign transactions)
>          wait for reply of commit

What I'm thinking is to pass a flag, say FDWXACT_ASYNC, to
Commit/RollbackForeignTransaction() and add a new API to wait for the
operation to complete, say CompleteForeignTransaction(). If
commit/rollback callback in an FDW is called with FDWXACT_ASYNC flag,
it should send the command and immediately return the handler (e.g.,
PQsocket() in postgres_fdw). The GTM gathers the handlers and poll
events on them. To complete the command, the GTM calls
CompleteForeignTransaction() to wait for the command to complete.
Please refer to XA specification for details (especially xa_complete()
and TMASYNC flag). A pseudo-code is something like the followings:

    foreach (foreign transactions)
        call CommitForeignTransaction(FDWXACT_ASYNC);
        append the returned fd to the array.

    while (true)
    {
        poll event on fds;
        call CompleteForeignTransaction() for fd owner;
        if (success)
            remove fd from the array;

        if (array is empty)
            break;
    }

>
> On second thought, new per-transaction commit/rollback callback is essential when users or the resolver process want
toresolve the specifed foreign transaction, but not essential when backends commit/rollback foreign transactions. That
is,even if we add per-transaction new API for users and resolver process, backends can still use CallXactCallbacks()
whenthey commit/rollback foreign transactions. Is this understanding right? 

I haven’t tried that but I think that's possible if we can know
commit/rollback callback (e.g., postgresCommitForeignTransaction() etc
in postgres_fdw) is called via SQL function (pg_resolve_foreign_xact()
SQL function) or called by the resolver process. That is, we register
foreign transaction via FdwXactRegisterXact(), don’t do nothing in
postgresCommit/RollbackForeignTransaction() if these are called by the
backend, and perform COMMIT/ROLLBACK in pgfdw_xact_callback() in
asynchronous manner. On the other hand, if
postgresCommit/RollbackForeignTransaction() is called via SQL function
or by the resolver these functions commit/rollback the transaction.

>
> > Regarding cursor_number, it essentially needs to be unique at least
> > within a transaction so we can manage it per transaction or per
> > connection. But the current postgres_fdw rather ensure uniqueness
> > across all connections. So it seems to me that this can be fixed by
> > making individual connection have cursor_number and resetting it in
> > pgfdw_cleanup_after_transaction(). I think this can be in a separate
> > patch.
>
> Maybe, so let's work on this later, at least after we confirm that
> this change is really necessary.

Okay.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Feb 5, 2021 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >
> >
> >
> > On 2021/01/27 14:08, Masahiko Sawada wrote:
> > > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
> > > <masao.fujii@oss.nttdata.com> wrote:
> > >>
> > >>
> > >> You fixed some issues. But maybe you forgot to attach the latest patches?
> > >
> > > Yes, I've attached the updated patches.
> >
> > Thanks for updating the patch! I tried to review 0001 and 0002 as the self-contained change.
> >
> > + * An FDW that implements both commit and rollback APIs can request to register
> > + * the foreign transaction by FdwXactRegisterXact() to participate it to a
> > + * group of distributed tranasction.  The registered foreign transactions are
> > + * identified by OIDs of server and user.
> >
> > I'm afraid that the combination of OIDs of server and user is not unique. IOW, more than one foreign transactions
canhave the same combination of OIDs of server and user. For example, the following two SELECT queries start the
differentforeign transactions but their user OID is the same. OID of user mapping should be used instead of OID of
user?
> >
> >      CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw;
> >      CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user 'postgres');
> >      CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user 'postgres');
> >      CREATE TABLE t(i int);
> >      CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS (table_name 't');
> >      BEGIN;
> >      SELECT * FROM ft;
> >      DROP USER MAPPING FOR postgres SERVER loopback ;
> >      SELECT * FROM ft;
> >      COMMIT;
>
> Good catch. I've considered using user mapping OID or a pair of user
> mapping OID and server OID as a key of foreign transactions but I
> think it also has a problem if an FDW caches the connection by pair of
> server OID and user OID whereas the core identifies them by user
> mapping OID. For instance, mysql_fdw manages connections by pair of
> server OID and user OID.
>
> For example, let's consider the following execution:
>
> BEGIN;
> SET ROLE user_A;
> INSERT INTO ft1 VALUES (1);
> SET ROLE user_B;
> INSERT INTO ft1 VALUES (1);
> COMMIT;
>
> Suppose that an FDW identifies the connections by {server OID, user
> OID} and the core GTM identifies the transactions by user mapping OID,
> and user_A and user_B use the public user mapping to connect server_X.
> In the FDW, there are two connections identified by {user_A, sever_X}
> and {user_B, server_X} respectively, and therefore opens two
> transactions on each connection, while GTM has only one FdwXact entry
> because the two connections refer to the same user mapping OID. As a
> result, at the end of the transaction, GTM ends only one foreign
> transaction, leaving another one.
>
> Using user mapping OID seems natural to me but I'm concerned that
> changing role in the middle of transaction is likely to happen than
> dropping the public user mapping but not sure. We would need to find
> more better way.

After more thought, I'm inclined to think it's better to identify
foreign transactions by user mapping OID. The main reason is, I think
FDWs that manages connection caches by pair of user OID and server OID
potentially has a problem with the scenario Fujii-san mentioned. If an
FDW has to use another user mapping (i.g., connection information) due
to the currently used user mapping being removed, it would have to
disconnect the previous connection because it has to use the same
connection cache. But at that time it doesn't know the transaction
will be committed or aborted.

Also, such FDW has the same problem that postgres_fdw used to have; a
backend establishes multiple connections with the same connection
information if multiple local users use the public user mapping. Even
from the perspective of foreign transaction management, it more makes
sense that foreign transactions correspond to the connections to
foreign servers, not to the local connection information.

I can see that some FDW implementations such as mysql_fdw and
firebird_fdw identify connections by pair of server OID and user OID
but I think this is because they consulted to old postgres_fdw code. I
suspect that there is no use case where FDW needs to identify
connections in that way. If the core GTM identifies them by user
mapping OID, we could enforce those FDWs to change their way but I
think that change would be the right improvement.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/03/17 12:03, Masahiko Sawada wrote:
> I've attached the updated version patch set.

Thanks for updating the patches! I'm now restarting to review of 2PC because
I'd like to use this feature in PG15.


I think the following logic of resolving and removing the fdwxact entries
by the transaction resolver needs to be fixed.

1. check if pending fdwxact entries exist

HoldInDoubtFdwXacts() checks if there are entries which the condition is
InvalidBackendId and so on. After that it gets the indexes of the fdwxacts
array. The fdwXactLock is released at the end of this phase.

2. resolve and remove the entries held in 1th phase.

ResolveFdwXacts() resloves the status per each fdwxact entry using the
indexes. The end of resolving, the transaction resolver remove the entry in
fdwxacts array via remove_fdwact().

The way to remove the entry is the following. Since to control using the
index, the indexes of getting in the 1st phase are meaningless anymore.

/* Remove the entry from active array */
FdwXactCtl->num_fdwxacts--;
FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts];

This seems to lead resolving the unexpected fdwxacts and it can occur the
following assertion error. That's why I noticed. For example, there is the
case which a backend inserts new fdwxact entry in the free space, which the
resolver removed the entry right before, and the resolver accesses the new
entry which doesn't need to resolve yet because it use the indexes checked in
1st phase.

Assert(fdwxact->locking_backend == MyBackendId);



The simple solution is that to get fdwXactLock exclusive all the time from the
begining of 1st phase to the finishing of 2nd phase. But, I worried that the
performance impact became too big...

I came up with two solutions although there may be better solutions.

A. to remove resolved entries at once after resolution for all held entries is
finished

If so, we don't need to take the exclusive lock for a long time. But, this
have other problems, which pg_remove_foreign_xact() can still remove entries
and we need to handle the fail of resolving.

I wondered that we can solve the first problem to introduce a new lock like
"removing lock" and only the processes which hold the lock can remove the
entries. The performance impact is limited since the insertion the fdwxact
entries is not blocked by this lock. And second problem can be solved using
try-catch sentence.


B. to merge 1st and 2nd phase

Now, the resolver resolves the entries together. That's the reason why it's
difficult to remove the entries. So, it seems to solve the problem to execute
checking, resolving and removing per each entry. I think it's better since
this is simpler than A. If I'm missing something, please let me know.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Apr 27, 2021 at 10:03 AM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/03/17 12:03, Masahiko Sawada wrote:
> > I've attached the updated version patch set.
>
> Thanks for updating the patches! I'm now restarting to review of 2PC because
> I'd like to use this feature in PG15.

Thank you for reviewing the patch! Much appreciated.

>
>
> I think the following logic of resolving and removing the fdwxact entries
> by the transaction resolver needs to be fixed.
>
> 1. check if pending fdwxact entries exist
>
> HoldInDoubtFdwXacts() checks if there are entries which the condition is
> InvalidBackendId and so on. After that it gets the indexes of the fdwxacts
> array. The fdwXactLock is released at the end of this phase.
>
> 2. resolve and remove the entries held in 1th phase.
>
> ResolveFdwXacts() resloves the status per each fdwxact entry using the
> indexes. The end of resolving, the transaction resolver remove the entry in
> fdwxacts array via remove_fdwact().
>
> The way to remove the entry is the following. Since to control using the
> index, the indexes of getting in the 1st phase are meaningless anymore.
>
> /* Remove the entry from active array */
> FdwXactCtl->num_fdwxacts--;
> FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts];
>
> This seems to lead resolving the unexpected fdwxacts and it can occur the
> following assertion error. That's why I noticed. For example, there is the
> case which a backend inserts new fdwxact entry in the free space, which the
> resolver removed the entry right before, and the resolver accesses the new
> entry which doesn't need to resolve yet because it use the indexes checked in
> 1st phase.
>
> Assert(fdwxact->lockeing_backend == MyBackendId);

Good point. I agree with your analysis.

>
>
>
> The simple solution is that to get fdwXactLock exclusive all the time from the
> begining of 1st phase to the finishing of 2nd phase. But, I worried that the
> performance impact became too big...
>
> I came up with two solutions although there may be better solutions.
>
> A. to remove resolved entries at once after resolution for all held entries is
> finished
>
> If so, we don't need to take the exclusive lock for a long time. But, this
> have other problems, which pg_remove_foreign_xact() can still remove entries
> and we need to handle the fail of resolving.
>
> I wondered that we can solve the first problem to introduce a new lock like
> "removing lock" and only the processes which hold the lock can remove the
> entries. The performance impact is limited since the insertion the fdwxact
> entries is not blocked by this lock. And second problem can be solved using
> try-catch sentence.
>
>
> B. to merge 1st and 2nd phase
>
> Now, the resolver resolves the entries together. That's the reason why it's
> difficult to remove the entries. So, it seems to solve the problem to execute
> checking, resolving and removing per each entry. I think it's better since
> this is simpler than A. If I'm missing something, please let me know.

It seems to me that solution B would be simpler and better. I'll try
to fix this issue by using solution B and rebase the patch.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> With this commit, the foreign server modified within the transaction marked as 'modified'.
>
> transaction marked -> transaction is marked

Will fix.

>
> +#define IsForeignTwophaseCommitRequested() \
> +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>
> Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named:
IsForeignTwophaseCommitRequired.

But even if foreign_twophase_commit is
FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
there is only one modified server, right? It seems the name
IsForeignTwophaseCommitRequested is fine.

>
> +static bool
> +checkForeignTwophaseCommitRequired(bool local_modified)
>
> +       if (!ServerSupportTwophaseCommit(fdw_part))
> +           have_no_twophase = true;
> ...
> +   if (have_no_twophase)
> +       ereport(ERROR,
>
> It seems the error case should be reported within the loop. This way, we don't need to iterate the other
participant(s).
> Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop
wouldbecome if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
 
> have_no_twophase is no longer needed.

Hmm, I think If we process one 2pc-non-capable server first and then
process another one 2pc-capable server, we should raise an error but
cannot detect that.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:


On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> With this commit, the foreign server modified within the transaction marked as 'modified'.
>
> transaction marked -> transaction is marked

Will fix.

>
> +#define IsForeignTwophaseCommitRequested() \
> +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>
> Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.

But even if foreign_twophase_commit is
FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
there is only one modified server, right? It seems the name
IsForeignTwophaseCommitRequested is fine.

>
> +static bool
> +checkForeignTwophaseCommitRequired(bool local_modified)
>
> +       if (!ServerSupportTwophaseCommit(fdw_part))
> +           have_no_twophase = true;
> ...
> +   if (have_no_twophase)
> +       ereport(ERROR,
>
> It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
> Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
> have_no_twophase is no longer needed.

Hmm, I think If we process one 2pc-non-capable server first and then
process another one 2pc-capable server, we should raise an error but
cannot detect that.

Then the check would stay as what you have in the patch:

  if (!ServerSupportTwophaseCommit(fdw_part))

When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.
have_no_twophase can be dropped.

Thanks
 

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> > Hi,
>> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>>
>> Thank you for reviewing the patch!
>>
>> >
>> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >
>> > transaction marked -> transaction is marked
>>
>> Will fix.
>>
>> >
>> > +#define IsForeignTwophaseCommitRequested() \
>> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >
>> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named:
IsForeignTwophaseCommitRequired.
>>
>> But even if foreign_twophase_commit is
>> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> there is only one modified server, right? It seems the name
>> IsForeignTwophaseCommitRequested is fine.
>>
>> >
>> > +static bool
>> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >
>> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> > +           have_no_twophase = true;
>> > ...
>> > +   if (have_no_twophase)
>> > +       ereport(ERROR,
>> >
>> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other
participant(s).
>> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop
wouldbecome if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
 
>> > have_no_twophase is no longer needed.
>>
>> Hmm, I think If we process one 2pc-non-capable server first and then
>> process another one 2pc-capable server, we should raise an error but
>> cannot detect that.
>
>
> Then the check would stay as what you have in the patch:
>
>   if (!ServerSupportTwophaseCommit(fdw_part))
>
> When the non-2pc-capable server is encountered, we would report the error in place (following the
ServerSupportTwophaseCommitcheck) and come out of the loop.
 
> have_no_twophase can be dropped.

But if we processed only one non-2pc-capable server, we would raise an
error but should not in that case.

On second thought, I think we can track how many servers are modified
or not capable of 2PC during registration and unr-egistration. Then we
can consider both 2PC is required and there is non-2pc-capable server
is involved without looking through all participants. Thoughts?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:


On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> > Hi,
>> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>>
>> Thank you for reviewing the patch!
>>
>> >
>> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >
>> > transaction marked -> transaction is marked
>>
>> Will fix.
>>
>> >
>> > +#define IsForeignTwophaseCommitRequested() \
>> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >
>> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.
>>
>> But even if foreign_twophase_commit is
>> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> there is only one modified server, right? It seems the name
>> IsForeignTwophaseCommitRequested is fine.
>>
>> >
>> > +static bool
>> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >
>> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> > +           have_no_twophase = true;
>> > ...
>> > +   if (have_no_twophase)
>> > +       ereport(ERROR,
>> >
>> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
>> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
>> > have_no_twophase is no longer needed.
>>
>> Hmm, I think If we process one 2pc-non-capable server first and then
>> process another one 2pc-capable server, we should raise an error but
>> cannot detect that.
>
>
> Then the check would stay as what you have in the patch:
>
>   if (!ServerSupportTwophaseCommit(fdw_part))
>
> When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.
> have_no_twophase can be dropped.

But if we processed only one non-2pc-capable server, we would raise an
error but should not in that case.

On second thought, I think we can track how many servers are modified
or not capable of 2PC during registration and unr-egistration. Then we
can consider both 2PC is required and there is non-2pc-capable server
is involved without looking through all participants. Thoughts?

That is something worth trying.

Thanks
 

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Mon, May 3, 2021 at 11:11 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> >
>> >
>> > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >> >
>> >> > Hi,
>> >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>> >>
>> >> Thank you for reviewing the patch!
>> >>
>> >> >
>> >> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >> >
>> >> > transaction marked -> transaction is marked
>> >>
>> >> Will fix.
>> >>
>> >> >
>> >> > +#define IsForeignTwophaseCommitRequested() \
>> >> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >> >
>> >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named:
IsForeignTwophaseCommitRequired.
>> >>
>> >> But even if foreign_twophase_commit is
>> >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> >> there is only one modified server, right? It seems the name
>> >> IsForeignTwophaseCommitRequested is fine.
>> >>
>> >> >
>> >> > +static bool
>> >> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >> >
>> >> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> >> > +           have_no_twophase = true;
>> >> > ...
>> >> > +   if (have_no_twophase)
>> >> > +       ereport(ERROR,
>> >> >
>> >> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other
participant(s).
>> >> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the
loopwould become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
 
>> >> > have_no_twophase is no longer needed.
>> >>
>> >> Hmm, I think If we process one 2pc-non-capable server first and then
>> >> process another one 2pc-capable server, we should raise an error but
>> >> cannot detect that.
>> >
>> >
>> > Then the check would stay as what you have in the patch:
>> >
>> >   if (!ServerSupportTwophaseCommit(fdw_part))
>> >
>> > When the non-2pc-capable server is encountered, we would report the error in place (following the
ServerSupportTwophaseCommitcheck) and come out of the loop.
 
>> > have_no_twophase can be dropped.
>>
>> But if we processed only one non-2pc-capable server, we would raise an
>> error but should not in that case.
>>
>> On second thought, I think we can track how many servers are modified
>> or not capable of 2PC during registration and unr-egistration. Then we
>> can consider both 2PC is required and there is non-2pc-capable server
>> is involved without looking through all participants. Thoughts?
>
>
> That is something worth trying.
>

I've attached the updated patches that incorporated comments from
Zhihong and Ikeda-san.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Zhihong Yu
Date:


On Mon, May 10, 2021 at 9:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Mon, May 3, 2021 at 11:11 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> >
>> >
>> > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >> >
>> >> > Hi,
>> >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>> >>
>> >> Thank you for reviewing the patch!
>> >>
>> >> >
>> >> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >> >
>> >> > transaction marked -> transaction is marked
>> >>
>> >> Will fix.
>> >>
>> >> >
>> >> > +#define IsForeignTwophaseCommitRequested() \
>> >> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >> >
>> >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.
>> >>
>> >> But even if foreign_twophase_commit is
>> >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> >> there is only one modified server, right? It seems the name
>> >> IsForeignTwophaseCommitRequested is fine.
>> >>
>> >> >
>> >> > +static bool
>> >> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >> >
>> >> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> >> > +           have_no_twophase = true;
>> >> > ...
>> >> > +   if (have_no_twophase)
>> >> > +       ereport(ERROR,
>> >> >
>> >> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
>> >> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
>> >> > have_no_twophase is no longer needed.
>> >>
>> >> Hmm, I think If we process one 2pc-non-capable server first and then
>> >> process another one 2pc-capable server, we should raise an error but
>> >> cannot detect that.
>> >
>> >
>> > Then the check would stay as what you have in the patch:
>> >
>> >   if (!ServerSupportTwophaseCommit(fdw_part))
>> >
>> > When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.
>> > have_no_twophase can be dropped.
>>
>> But if we processed only one non-2pc-capable server, we would raise an
>> error but should not in that case.
>>
>> On second thought, I think we can track how many servers are modified
>> or not capable of 2PC during registration and unr-egistration. Then we
>> can consider both 2PC is required and there is non-2pc-capable server
>> is involved without looking through all participants. Thoughts?
>
>
> That is something worth trying.
>

I've attached the updated patches that incorporated comments from
Zhihong and Ikeda-san.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Hi,
For v36-0005-Prepare-foreign-transactions-at-commit-time.patch :

With this commit, the foreign server modified within the transaction
marked as 'modified'.

The verb is missing from the above sentence. 'within the transaction marked ' -> within the transaction is marked

+   /* true if modified the data on the server */

modified the data -> data is modified

+   xid = GetTopTransactionIdIfAny();
...
+       if (!TransactionIdIsValid(xid))
+           xid = GetTopTransactionId();

I wonder when the above if condition is true, would the GetTopTransactionId() get valid xid ? It seems the two func calls are the same.

I like the way checkForeignTwophaseCommitRequired() is structured.

Cheers

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:
On 2021/05/11 13:37, Masahiko Sawada wrote:
> I've attached the updated patches that incorporated comments from
> Zhihong and Ikeda-san.

Thanks for updating the patches!


I have other comments including trivial things.


a. about "foreign_transaction_resolver_timeout" parameter

Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
Is there any reason? Although the following is minor case, it may confuse some
users.

Example case is that

1. a client executes transaction with 2PC when the resolver is processing
FdwXactResolverProcessInDoubtXacts().

2. the resolution of 1st transaction must be waited until the other
transactions for 2pc are executed or timeout.

3. if the client check the 1st result value, it should wait until resolution
is finished for atomic visibility (although it depends on the way how to
realize atomic visibility.) The clients may be waited
foreign_transaction_resolver_timeout". Users may think it's stale.

Like this situation can be observed after testing with pgbench. Some
unresolved transaction remains after benchmarking.

I assume that this default value refers to wal_sender, archiver, and so on.
But, I think this parameter is more like "commit_delay". If so, 60 seconds
seems to be big value.


b. about performance bottleneck (just share my simple benchmark results)

The resolver process can be performance bottleneck easily although I think
some users want this feature even if the performance is not so good.

I tested with very simple workload in my laptop.

The test condition is
* two remote foreign partitions and one transaction inserts an entry in each
partitions.
* local connection only. If NW latency became higher, the performance became
worse.
* pgbench with 8 clients.

The test results is the following. The performance of 2PC is only 10%
performance of the one of without 2PC.

* with foreign_twophase_commit = requried
-> If load with more than 10TPS, the number of unresolved foreign transactions
is increasing and stop with the warning "Increase
max_prepared_foreign_transactions".

* with foreign_twophase_commit = disabled
-> 122TPS in my environments.


c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch

* typo: s/tranasction/transaction/

* Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
(IsInParallelMode())" because make them in the same order as CommitTransaction()?

* functions name of fdwxact.c

Although this depends on my feeling, xact means transaction. If this feeling
same as you, the function names of FdwXactRegisterXact and so on are odd to
me. FdwXactRegisterEntry or FdwXactRegisterParticipant is better?

* Are the following better?

- s/to register the foreign transaction by/to register the foreign transaction
participant by/

- s/The registered foreign transactions/The registered participants/

- s/given foreign transaction/given foreign transaction participant/

- s/Foreign transactions involved in the current transaction/Foreign
transaction participants involved in the current transaction/


Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
> On 2021/05/11 13:37, Masahiko Sawada wrote:
> > I've attached the updated patches that incorporated comments from
> > Zhihong and Ikeda-san.
>
> Thanks for updating the patches!
>
>
> I have other comments including trivial things.
>
>
> a. about "foreign_transaction_resolver_timeout" parameter
>
> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
> Is there any reason? Although the following is minor case, it may confuse some
> users.
>
> Example case is that
>
> 1. a client executes transaction with 2PC when the resolver is processing
> FdwXactResolverProcessInDoubtXacts().
>
> 2. the resolution of 1st transaction must be waited until the other
> transactions for 2pc are executed or timeout.
>
> 3. if the client check the 1st result value, it should wait until resolution
> is finished for atomic visibility (although it depends on the way how to
> realize atomic visibility.) The clients may be waited
> foreign_transaction_resolver_timeout". Users may think it's stale.
>
> Like this situation can be observed after testing with pgbench. Some
> unresolved transaction remains after benchmarking.
>
> I assume that this default value refers to wal_sender, archiver, and so on.
> But, I think this parameter is more like "commit_delay". If so, 60 seconds
> seems to be big value.

IIUC this situation seems like the foreign transaction resolution is
bottle-neck and doesn’t catch up to incoming resolution requests. But
how foreignt_transaction_resolver_timeout relates to this situation?
foreign_transaction_resolver_timeout controls when to terminate the
resolver process that doesn't have any foreign transactions to
resolve. So if we set it several milliseconds, resolver processes are
terminated immediately after each resolution, imposing the cost of
launching resolver processes on the next resolution.

>
>
> b. about performance bottleneck (just share my simple benchmark results)
>
> The resolver process can be performance bottleneck easily although I think
> some users want this feature even if the performance is not so good.
>
> I tested with very simple workload in my laptop.
>
> The test condition is
> * two remote foreign partitions and one transaction inserts an entry in each
> partitions.
> * local connection only. If NW latency became higher, the performance became
> worse.
> * pgbench with 8 clients.
>
> The test results is the following. The performance of 2PC is only 10%
> performance of the one of without 2PC.
>
> * with foreign_twophase_commit = requried
> -> If load with more than 10TPS, the number of unresolved foreign transactions
> is increasing and stop with the warning "Increase
> max_prepared_foreign_transactions".

What was the value of max_prepared_foreign_transactions?

To speed up the foreign transaction resolution, some ideas have been
discussed. As another idea, how about launching resolvers for each
foreign server? That way, we resolve foreign transactions on each
foreign server in parallel. If foreign transactions are concentrated
on the particular server, we can have multiple resolvers for the one
foreign server. It doesn’t change the fact that all foreign
transaction resolutions are processed by resolver processes.

Apart from that, we also might want to improve foreign transaction
management so that transaction doesn’t end up with an error if the
foreign transaction resolution doesn’t catch up with incoming
transactions that require 2PC. Maybe we can evict and serialize a
state file when FdwXactCtl->xacts[] is full. I’d like to leave it as a
future improvement.

> * with foreign_twophase_commit = disabled
> -> 122TPS in my environments.

How much is the performance without those 2PC patches and with the
same workload? i.e., how fast is the current postgres_fdw that uses
XactCallback?

>
>
> c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch
>
> * typo: s/tranasction/transaction/
>
> * Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
> (IsInParallelMode())" because make them in the same order as CommitTransaction()?

I'd prefer to move AtEOXact_FdwXact() in CommitTransaction after "if
(IsInParallelMode())" since other pre-commit works are done after
cleaning parallel contexts. What do you think?

>
> * functions name of fdwxact.c
>
> Although this depends on my feeling, xact means transaction. If this feeling
> same as you, the function names of FdwXactRegisterXact and so on are odd to
> me. FdwXactRegisterEntry or FdwXactRegisterParticipant is better?
>

FdwXactRegisterEntry sounds good to me. Thanks.

> * Are the following better?
>
> - s/to register the foreign transaction by/to register the foreign transaction
> participant by/
>
> - s/The registered foreign transactions/The registered participants/
>
> - s/given foreign transaction/given foreign transaction participant/
>
> - s/Foreign transactions involved in the current transaction/Foreign
> transaction participants involved in the current transaction/

Agreed with the above suggestions.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/05/21 10:39, Masahiko Sawada wrote:
> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>>> I've attached the updated patches that incorporated comments from
>>> Zhihong and Ikeda-san.
>>
>> Thanks for updating the patches!
>>
>>
>> I have other comments including trivial things.
>>
>>
>> a. about "foreign_transaction_resolver_timeout" parameter
>>
>> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
>> Is there any reason? Although the following is minor case, it may confuse some
>> users.
>>
>> Example case is that
>>
>> 1. a client executes transaction with 2PC when the resolver is processing
>> FdwXactResolverProcessInDoubtXacts().
>>
>> 2. the resolution of 1st transaction must be waited until the other
>> transactions for 2pc are executed or timeout.
>>
>> 3. if the client check the 1st result value, it should wait until resolution
>> is finished for atomic visibility (although it depends on the way how to
>> realize atomic visibility.) The clients may be waited
>> foreign_transaction_resolver_timeout". Users may think it's stale.
>>
>> Like this situation can be observed after testing with pgbench. Some
>> unresolved transaction remains after benchmarking.
>>
>> I assume that this default value refers to wal_sender, archiver, and so on.
>> But, I think this parameter is more like "commit_delay". If so, 60 seconds
>> seems to be big value.
> 
> IIUC this situation seems like the foreign transaction resolution is
> bottle-neck and doesn’t catch up to incoming resolution requests. But
> how foreignt_transaction_resolver_timeout relates to this situation?
> foreign_transaction_resolver_timeout controls when to terminate the
> resolver process that doesn't have any foreign transactions to
> resolve. So if we set it several milliseconds, resolver processes are
> terminated immediately after each resolution, imposing the cost of
> launching resolver processes on the next resolution.

Thanks for your comments!

No, this situation is not related to the foreign transaction resolution is
bottle-neck or not. This issue may happen when the workload has very few
foreign transactions.

If new foreign transaction comes while the transaction resolver is processing
resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
waits until starting next transaction resolution. If next foreign transaction
doesn't come, the foreign transaction must wait starting resolution until
timeout. I mentioned this situation.

Thanks for letting me know the side effect if setting resolution timeout to
several milliseconds. I agree. But, why termination is needed? Is there a
possibility to stale like walsender?


>>
>>
>> b. about performance bottleneck (just share my simple benchmark results)
>>
>> The resolver process can be performance bottleneck easily although I think
>> some users want this feature even if the performance is not so good.
>>
>> I tested with very simple workload in my laptop.
>>
>> The test condition is
>> * two remote foreign partitions and one transaction inserts an entry in each
>> partitions.
>> * local connection only. If NW latency became higher, the performance became
>> worse.
>> * pgbench with 8 clients.
>>
>> The test results is the following. The performance of 2PC is only 10%
>> performance of the one of without 2PC.
>>
>> * with foreign_twophase_commit = requried
>> -> If load with more than 10TPS, the number of unresolved foreign transactions
>> is increasing and stop with the warning "Increase
>> max_prepared_foreign_transactions".
> 
> What was the value of max_prepared_foreign_transactions?

Now, I tested with 200.

If each resolution is finished very soon, I thought it's enough because
8clients x 2partitions = 16, though... But, it's difficult how to know the
stable values.


> To speed up the foreign transaction resolution, some ideas have been
> discussed. As another idea, how about launching resolvers for each
> foreign server? That way, we resolve foreign transactions on each
> foreign server in parallel. If foreign transactions are concentrated
> on the particular server, we can have multiple resolvers for the one
> foreign server. It doesn’t change the fact that all foreign
> transaction resolutions are processed by resolver processes.

Awesome! There seems to be another pros that even if a foreign server is
temporarily busy or stopped due to fail over, other foreign server's
transactions can be resolved.



> Apart from that, we also might want to improve foreign transaction
> management so that transaction doesn’t end up with an error if the
> foreign transaction resolution doesn’t catch up with incoming
> transactions that require 2PC. Maybe we can evict and serialize a
> state file when FdwXactCtl->xacts[] is full. I’d like to leave it as a
> future improvement.

Oh, great! I didn't come up with the idea.

Although I thought the feature makes difficult to know the foreign transaction
is resolved stably, DBAs can check "pg_foreign_xacts" view now and it's enough
to output the situation of foreign transactions are spilled to the log.


>> * with foreign_twophase_commit = disabled
>> -> 122TPS in my environments.
> 
> How much is the performance without those 2PC patches and with the
> same workload? i.e., how fast is the current postgres_fdw that uses
> XactCallback?

OK, I'll test.


>> c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch
>>
>> * typo: s/tranasction/transaction/
>>
>> * Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
>> (IsInParallelMode())" because make them in the same order as CommitTransaction()?
> 
> I'd prefer to move AtEOXact_FdwXact() in CommitTransaction after "if
> (IsInParallelMode())" since other pre-commit works are done after
> cleaning parallel contexts. What do you think?

OK, I agree.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, May 21, 2021 at 12:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/21 10:39, Masahiko Sawada wrote:
> > On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >>
> >>
> >> On 2021/05/11 13:37, Masahiko Sawada wrote:
> >>> I've attached the updated patches that incorporated comments from
> >>> Zhihong and Ikeda-san.
> >>
> >> Thanks for updating the patches!
> >>
> >>
> >> I have other comments including trivial things.
> >>
> >>
> >> a. about "foreign_transaction_resolver_timeout" parameter
> >>
> >> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
> >> Is there any reason? Although the following is minor case, it may confuse some
> >> users.
> >>
> >> Example case is that
> >>
> >> 1. a client executes transaction with 2PC when the resolver is processing
> >> FdwXactResolverProcessInDoubtXacts().
> >>
> >> 2. the resolution of 1st transaction must be waited until the other
> >> transactions for 2pc are executed or timeout.
> >>
> >> 3. if the client check the 1st result value, it should wait until resolution
> >> is finished for atomic visibility (although it depends on the way how to
> >> realize atomic visibility.) The clients may be waited
> >> foreign_transaction_resolver_timeout". Users may think it's stale.
> >>
> >> Like this situation can be observed after testing with pgbench. Some
> >> unresolved transaction remains after benchmarking.
> >>
> >> I assume that this default value refers to wal_sender, archiver, and so on.
> >> But, I think this parameter is more like "commit_delay". If so, 60 seconds
> >> seems to be big value.
> >
> > IIUC this situation seems like the foreign transaction resolution is
> > bottle-neck and doesn’t catch up to incoming resolution requests. But
> > how foreignt_transaction_resolver_timeout relates to this situation?
> > foreign_transaction_resolver_timeout controls when to terminate the
> > resolver process that doesn't have any foreign transactions to
> > resolve. So if we set it several milliseconds, resolver processes are
> > terminated immediately after each resolution, imposing the cost of
> > launching resolver processes on the next resolution.
>
> Thanks for your comments!
>
> No, this situation is not related to the foreign transaction resolution is
> bottle-neck or not. This issue may happen when the workload has very few
> foreign transactions.
>
> If new foreign transaction comes while the transaction resolver is processing
> resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
> waits until starting next transaction resolution. If next foreign transaction
> doesn't come, the foreign transaction must wait starting resolution until
> timeout. I mentioned this situation.

Thanks for your explanation. I think that in this case we should set
the latch of the resolver after preparing all foreign transactions so
that the resolver process those transactions without sleep.

>
> Thanks for letting me know the side effect if setting resolution timeout to
> several milliseconds. I agree. But, why termination is needed? Is there a
> possibility to stale like walsender?

The purpose of this timeout is to terminate resolvers that are idle
for a long time. The resolver processes don't necessarily need to keep
running all the time for every database. On the other hand, launching
a resolver process per commit would be a high cost. So we have
resolver processes keep running at least for
foreign_transaction_resolver_timeout.

>
>
> >>
> >>
> >> b. about performance bottleneck (just share my simple benchmark results)
> >>
> >> The resolver process can be performance bottleneck easily although I think
> >> some users want this feature even if the performance is not so good.
> >>
> >> I tested with very simple workload in my laptop.
> >>
> >> The test condition is
> >> * two remote foreign partitions and one transaction inserts an entry in each
> >> partitions.
> >> * local connection only. If NW latency became higher, the performance became
> >> worse.
> >> * pgbench with 8 clients.
> >>
> >> The test results is the following. The performance of 2PC is only 10%
> >> performance of the one of without 2PC.
> >>
> >> * with foreign_twophase_commit = requried
> >> -> If load with more than 10TPS, the number of unresolved foreign transactions
> >> is increasing and stop with the warning "Increase
> >> max_prepared_foreign_transactions".
> >
> > What was the value of max_prepared_foreign_transactions?
>
> Now, I tested with 200.
>
> If each resolution is finished very soon, I thought it's enough because
> 8clients x 2partitions = 16, though... But, it's difficult how to know the
> stable values.

During resolving one distributed transaction, the resolver needs both
one round trip and fsync-ing WAL record for each foreign transaction.
Since the client doesn’t wait for the distributed transaction to be
resolved, the resolver process can be easily bottle-neck given there
are 8 clients.

If foreign transaction resolution was resolved synchronously, 16 would suffice.

>
>
> > To speed up the foreign transaction resolution, some ideas have been
> > discussed. As another idea, how about launching resolvers for each
> > foreign server? That way, we resolve foreign transactions on each
> > foreign server in parallel. If foreign transactions are concentrated
> > on the particular server, we can have multiple resolvers for the one
> > foreign server. It doesn’t change the fact that all foreign
> > transaction resolutions are processed by resolver processes.
>
> Awesome! There seems to be another pros that even if a foreign server is
> temporarily busy or stopped due to fail over, other foreign server's
> transactions can be resolved.

Yes. We also might need to be careful about the order of foreign
transaction resolution. I think we need to resolve foreign
transactions in arrival order at least within a foreign server.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/05/21 13:45, Masahiko Sawada wrote:
> On Fri, May 21, 2021 at 12:45 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2021/05/21 10:39, Masahiko Sawada wrote:
>>> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>>>
>>>>
>>>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>>>>> I've attached the updated patches that incorporated comments from
>>>>> Zhihong and Ikeda-san.
>>>>
>>>> Thanks for updating the patches!
>>>>
>>>>
>>>> I have other comments including trivial things.
>>>>
>>>>
>>>> a. about "foreign_transaction_resolver_timeout" parameter
>>>>
>>>> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
>>>> Is there any reason? Although the following is minor case, it may confuse some
>>>> users.
>>>>
>>>> Example case is that
>>>>
>>>> 1. a client executes transaction with 2PC when the resolver is processing
>>>> FdwXactResolverProcessInDoubtXacts().
>>>>
>>>> 2. the resolution of 1st transaction must be waited until the other
>>>> transactions for 2pc are executed or timeout.
>>>>
>>>> 3. if the client check the 1st result value, it should wait until resolution
>>>> is finished for atomic visibility (although it depends on the way how to
>>>> realize atomic visibility.) The clients may be waited
>>>> foreign_transaction_resolver_timeout". Users may think it's stale.
>>>>
>>>> Like this situation can be observed after testing with pgbench. Some
>>>> unresolved transaction remains after benchmarking.
>>>>
>>>> I assume that this default value refers to wal_sender, archiver, and so on.
>>>> But, I think this parameter is more like "commit_delay". If so, 60 seconds
>>>> seems to be big value.
>>>
>>> IIUC this situation seems like the foreign transaction resolution is
>>> bottle-neck and doesn’t catch up to incoming resolution requests. But
>>> how foreignt_transaction_resolver_timeout relates to this situation?
>>> foreign_transaction_resolver_timeout controls when to terminate the
>>> resolver process that doesn't have any foreign transactions to
>>> resolve. So if we set it several milliseconds, resolver processes are
>>> terminated immediately after each resolution, imposing the cost of
>>> launching resolver processes on the next resolution.
>>
>> Thanks for your comments!
>>
>> No, this situation is not related to the foreign transaction resolution is
>> bottle-neck or not. This issue may happen when the workload has very few
>> foreign transactions.
>>
>> If new foreign transaction comes while the transaction resolver is processing
>> resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
>> waits until starting next transaction resolution. If next foreign transaction
>> doesn't come, the foreign transaction must wait starting resolution until
>> timeout. I mentioned this situation.
> 
> Thanks for your explanation. I think that in this case we should set
> the latch of the resolver after preparing all foreign transactions so
> that the resolver process those transactions without sleep.

Yes, your idea is much better. Thanks!


>>
>> Thanks for letting me know the side effect if setting resolution timeout to
>> several milliseconds. I agree. But, why termination is needed? Is there a
>> possibility to stale like walsender?
> 
> The purpose of this timeout is to terminate resolvers that are idle
> for a long time. The resolver processes don't necessarily need to keep
> running all the time for every database. On the other hand, launching
> a resolver process per commit would be a high cost. So we have
> resolver processes keep running at least for
> foreign_transaction_resolver_timeout.
Understood. I think it's reasonable.


>>>>
>>>>
>>>> b. about performance bottleneck (just share my simple benchmark results)
>>>>
>>>> The resolver process can be performance bottleneck easily although I think
>>>> some users want this feature even if the performance is not so good.
>>>>
>>>> I tested with very simple workload in my laptop.
>>>>
>>>> The test condition is
>>>> * two remote foreign partitions and one transaction inserts an entry in each
>>>> partitions.
>>>> * local connection only. If NW latency became higher, the performance became
>>>> worse.
>>>> * pgbench with 8 clients.
>>>>
>>>> The test results is the following. The performance of 2PC is only 10%
>>>> performance of the one of without 2PC.
>>>>
>>>> * with foreign_twophase_commit = requried
>>>> -> If load with more than 10TPS, the number of unresolved foreign transactions
>>>> is increasing and stop with the warning "Increase
>>>> max_prepared_foreign_transactions".
>>>
>>> What was the value of max_prepared_foreign_transactions?
>>
>> Now, I tested with 200.
>>
>> If each resolution is finished very soon, I thought it's enough because
>> 8clients x 2partitions = 16, though... But, it's difficult how to know the
>> stable values.
> 
> During resolving one distributed transaction, the resolver needs both
> one round trip and fsync-ing WAL record for each foreign transaction.
> Since the client doesn’t wait for the distributed transaction to be
> resolved, the resolver process can be easily bottle-neck given there
> are 8 clients.
> 
> If foreign transaction resolution was resolved synchronously, 16 would suffice.

OK, thanks.


>>
>>
>>> To speed up the foreign transaction resolution, some ideas have been
>>> discussed. As another idea, how about launching resolvers for each
>>> foreign server? That way, we resolve foreign transactions on each
>>> foreign server in parallel. If foreign transactions are concentrated
>>> on the particular server, we can have multiple resolvers for the one
>>> foreign server. It doesn’t change the fact that all foreign
>>> transaction resolutions are processed by resolver processes.
>>
>> Awesome! There seems to be another pros that even if a foreign server is
>> temporarily busy or stopped due to fail over, other foreign server's
>> transactions can be resolved.
> 
> Yes. We also might need to be careful about the order of foreign
> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.

I agree it's better.

(Although this is my interest...)
Is it necessary? Although this idea seems to be for atomic visibility,
2PC can't realize that as you know. So, I wondered that.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/21 13:45, Masahiko Sawada wrote:
> >
> > Yes. We also might need to be careful about the order of foreign
> > transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.
>
> I agree it's better.
>
> (Although this is my interest...)
> Is it necessary? Although this idea seems to be for atomic visibility,
> 2PC can't realize that as you know. So, I wondered that.

I think it's for fairness. If a foreign transaction arrived earlier
gets put off so often for other foreign transactions arrived later due
to its index in FdwXactCtl->xacts, it’s not understandable for users
and not fair. I think it’s better to handle foreign transactions in
FIFO manner (although this problem exists even in the current code).

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/05/25 21:59, Masahiko Sawada wrote:
> On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>> On 2021/05/21 13:45, Masahiko Sawada wrote:
>>>
>>> Yes. We also might need to be careful about the order of foreign
>>> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.
>>
>> I agree it's better.
>>
>> (Although this is my interest...)
>> Is it necessary? Although this idea seems to be for atomic visibility,
>> 2PC can't realize that as you know. So, I wondered that.
> 
> I think it's for fairness. If a foreign transaction arrived earlier
> gets put off so often for other foreign transactions arrived later due
> to its index in FdwXactCtl->xacts, it’s not understandable for users
> and not fair. I think it’s better to handle foreign transactions in
> FIFO manner (although this problem exists even in the current code).

OK, thanks.


On 2021/05/21 12:45, Masahiro Ikeda wrote:
> On 2021/05/21 10:39, Masahiko Sawada wrote:
>> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:
>> How much is the performance without those 2PC patches and with the
>> same workload? i.e., how fast is the current postgres_fdw that uses
>> XactCallback?
>
> OK, I'll test.

The test results are followings. But, I couldn't confirm the performance
improvements of 2PC patches though I may need to be changed the test condition.

[condition]
* 1 coordinator and 3 foreign servers
* There are two custom scripts which access different two foreign servers per
transaction

``` fxact_select.pgbench
BEGIN;
SELECT * FROM part:p1 WHERE id = :id;
SELECT * FROM part:p2 WHERE id = :id;
COMMIT;
```

``` fxact_update.pgbench
BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

[results]

I have tested three times.
Performance difference seems to be within the range of errors.

# 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
- fxact_update.pgbench
72.3, 74.9, 77.5  TPS  => avg 74.9 TPS
110.5, 106.8, 103.2  ms => avg 106.8 ms

- fxact_select.pgbench
1767.6, 1737.1, 1717.4 TPS  => avg 1740.7 TPS
4.5, 4.6, 4.7 ms => avg 4.6ms

# 6d0eb38557 without 2pc patches
- fxact_update.pgbench
76.5, 70.6, 69.5 TPS => avg 72.2 TPS
104.534 + 113.244 + 115.097 => avg 111.0 ms

-fxact_select.pgbench
1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
4.2, 4.6, 4.6 ms=>  4.5 ms





# About the bottleneck of the resolver process

I investigated the performance bottleneck of the resolver process using perf.
The main bottleneck is the following functions.

1st. 42.8% routine->CommitForeignTransaction()
2nd. 31.5% remove_fdwxact()
3rd. 10.16% CommitTransaction()

1st and 3rd problems can be solved by parallelizing resolver processes per
remote servers. But, I wondered that the idea, which backends call also
"COMMIT/ABORT PREPARED" and the resolver process only takes changes of
resolving in-doubt foreign transactions, is better. In many cases, I think
that the number of connections is much greater than the number of remote
servers. If so, the parallelization is not enough.

So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
the extension. So, this idea is not bad.

Although resolving asynchronously has the performance benefit, we can't take
advantage because the resolver process can be bottleneck easily now.


2nd remove_fdwxact() syncs the WAL, which indicates the foreign transaction
entry is removed. Is it necessary to sync momentarily?

To remove syncing leads the time of recovery phase may be longer because some
fdxact entries need to "COMMIT/ABORT PREPARED" again. But I think the effect
is limited.


# About other trivial comments.

* Is it better to call pgstat_send_wal() in the resolver process?

* Is it better to specify that only one resolver process can be launched in on
database on the descrpition of "max_foreign_transaction_resolvers"?

* Is it intentional that removing and inserting new lines in foreigncmds.c?

* Is it better that "max_prepared_foreign_transactions=%d" is after
"max_prepared_xacts=%d" in xlogdesc.c?

* Is "fdwxact_queue" unnecessary now?

* Is the following " + sizeof(FdwXactResolver)" unnecessary?

#define SizeOfFdwXactResolverCtlData \
    (offsetof(FdwXactResolverCtlData, resolvers) + sizeof(FdwXactResolver))

Although MultiXactStateData considered the backendIds start from 1 indexed,
the resolvers start from 0 indexed. Sorry, if my understanding is wrong.

* s/transaciton/transaction/

* s/foreign_xact_resolution_retry_interval since last
resolver/foreign_xact_resolution_retry_interval since last resolver was/

* Don't we need the debug log in the following in postgres.c like logical
launcher shutdown?

    else if (IsFdwXactLauncher())
    {
        /*
        * The foreign transaction launcher can be stopped at any time.
        * Use exit status 1 so the background worker is restarted.
        */
        proc_exit(1);
    }

* Is pg_stop_foreign_xact_resolver(PG_FUNCTION_ARGS) not documented?

* Is it better from "when arrived a requested by backend process." to
"when a request by backend process is arrived."?


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Jun 3, 2021 at 1:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/25 21:59, Masahiko Sawada wrote:
> > On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >>
> >> On 2021/05/21 13:45, Masahiko Sawada wrote:
> >>>
> >>> Yes. We also might need to be careful about the order of foreign
> >>> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a
foreignserver. 
> >>
> >> I agree it's better.
> >>
> >> (Although this is my interest...)
> >> Is it necessary? Although this idea seems to be for atomic visibility,
> >> 2PC can't realize that as you know. So, I wondered that.
> >
> > I think it's for fairness. If a foreign transaction arrived earlier
> > gets put off so often for other foreign transactions arrived later due
> > to its index in FdwXactCtl->xacts, it’s not understandable for users
> > and not fair. I think it’s better to handle foreign transactions in
> > FIFO manner (although this problem exists even in the current code).
>
> OK, thanks.
>
>
> On 2021/05/21 12:45, Masahiro Ikeda wrote:
> > On 2021/05/21 10:39, Masahiko Sawada wrote:
> >> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
> wrote:
> >> How much is the performance without those 2PC patches and with the
> >> same workload? i.e., how fast is the current postgres_fdw that uses
> >> XactCallback?
> >
> > OK, I'll test.
>
> The test results are followings. But, I couldn't confirm the performance
> improvements of 2PC patches though I may need to be changed the test condition.
>
> [condition]
> * 1 coordinator and 3 foreign servers
> * There are two custom scripts which access different two foreign servers per
> transaction
>
> ``` fxact_select.pgbench
> BEGIN;
> SELECT * FROM part:p1 WHERE id = :id;
> SELECT * FROM part:p2 WHERE id = :id;
> COMMIT;
> ```
>
> ``` fxact_update.pgbench
> BEGIN;
> UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> COMMIT;
> ```
>
> [results]
>
> I have tested three times.
> Performance difference seems to be within the range of errors.
>
> # 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
> - fxact_update.pgbench
> 72.3, 74.9, 77.5  TPS  => avg 74.9 TPS
> 110.5, 106.8, 103.2  ms => avg 106.8 ms
>
> - fxact_select.pgbench
> 1767.6, 1737.1, 1717.4 TPS  => avg 1740.7 TPS
> 4.5, 4.6, 4.7 ms => avg 4.6ms
>
> # 6d0eb38557 without 2pc patches
> - fxact_update.pgbench
> 76.5, 70.6, 69.5 TPS => avg 72.2 TPS
> 104.534 + 113.244 + 115.097 => avg 111.0 ms
>
> -fxact_select.pgbench
> 1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
> 4.2, 4.6, 4.6 ms=>  4.5 ms
>

Thank you for testing!

I think the result shows that managing foreign transactions on the
core side would not be a problem in terms of performance.

>
>
>
>
> # About the bottleneck of the resolver process
>
> I investigated the performance bottleneck of the resolver process using perf.
> The main bottleneck is the following functions.
>
> 1st. 42.8% routine->CommitForeignTransaction()
> 2nd. 31.5% remove_fdwxact()
> 3rd. 10.16% CommitTransaction()
>
> 1st and 3rd problems can be solved by parallelizing resolver processes per
> remote servers. But, I wondered that the idea, which backends call also
> "COMMIT/ABORT PREPARED" and the resolver process only takes changes of
> resolving in-doubt foreign transactions, is better. In many cases, I think
> that the number of connections is much greater than the number of remote
> servers. If so, the parallelization is not enough.
>
> So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
> better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
> the extension. So, this idea is not bad.

Thank you for pointing it out. This idea has been proposed several
times and there were discussions. I'd like to summarize the proposed
ideas and those pros and cons before replying to your other comments.

There are 3 ideas. After backend both prepares all foreign transaction
and commit the local transaction,

1. the backend continues attempting to commit all prepared foreign
transactions until all of them are committed.
2. the backend attempts to commit all prepared foreign transactions
once. If an error happens, leave them for the resolver.
3. the backend asks the resolver that launched per foreign server to
commit the prepared foreign transactions (and backend waits or doesn't
wait for the commit completion depending on the setting).

With ideas 1 and 2, since the backend itself commits all foreign
transactions the resolver process cannot be a bottleneck, and probably
the code can get more simple as backends don't need to communicate
with resolver processes.

However, those have two problems we need to deal with:

First, users could get an error if an error happens during the backend
committing prepared foreign transaction but the local transaction is
already committed and some foreign transactions could also be
committed, confusing users. There were two opinions to this problem:
FDW developers should be responsible for writing FDW code such that
any error doesn't happen during committing foreign transactions, and
users can accept that confusion since an error could happen after
writing the commit WAL even today without this 2PC feature. For the
former point, I'm not sure it's always doable since even palloc()
could raise an error and it seems hard to require all FDW developers
to understand all possible paths of raising an error. And for the
latter point, that's true but I think those cases are
should-not-happen cases (i.g., rare cases) whereas the likelihood of
an error during committing prepared transactions is not low (e.g., by
network connectivity problem). I think we need to assume that that is
not a rare case.

The second problem is whether we can cancel committing foreign
transactions by pg_cancel_backend() (or pressing Ctl-c). If the
backend process commits prepared foreign transactions, it's FDW
developers' responsibility to write code that is interruptible. I’m
not sure it’s feasible for drivers for other databases.

Idea 3 is proposed to deal with those problems. By having separate
processes, resolver processes, committing prepared foreign
transactions, we and FDW developers don't need to worry about those
two problems.

However as Ikeda-san shared the performance results, idea 3 is likely
to have a performance problem since resolver processes can easily be
bottle-neck. Moreover, with the current patch, since we asynchronously
commit foreign prepared transactions, if many concurrent clients use
2PC, reaching max_foreign_prepared_transactions,  transactions end up
with an error.

Through the long discussion on this thread, I've been thought we got a
consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for
dealing with the performance problem. Idea 1 and 2 are also good and
attractive, but I think we need to deal with the two problems first if
we go with one of those ideas. To be honest, I'm really not sure it's
good if we make those things FDW developers responsibility.

As long as we commit foreign prepared transactions asynchronously and
there is max_foreign_prepared_transactions limit, it's possible that
committing those transactions could not keep up. Maybe the same is
true for a case where the client heavily uses 2PC and asynchronously
commits prepared transactions. If committing prepared transactions
doesn't keep up with preparing transactions, the system reaches
max_prepared_transactions.

With the current patch, we commit prepared foreign transactions
asynchronously. But maybe we need to compare the performance of ideas
1 (and 2) to idea 3 with synchronous foreign transaction resolution.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
"ikedamsh@oss.nttdata.com"
Date:


2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:

On Thu, Jun 3, 2021 at 1:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:



On 2021/05/25 21:59, Masahiko Sawada wrote:
On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021/05/21 13:45, Masahiko Sawada wrote:

Yes. We also might need to be careful about the order of foreign
transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign server.

I agree it's better.

(Although this is my interest...)
Is it necessary? Although this idea seems to be for atomic visibility,
2PC can't realize that as you know. So, I wondered that.

I think it's for fairness. If a foreign transaction arrived earlier
gets put off so often for other foreign transactions arrived later due
to its index in FdwXactCtl->xacts, it’s not understandable for users
and not fair. I think it’s better to handle foreign transactions in
FIFO manner (although this problem exists even in the current code).

OK, thanks.


On 2021/05/21 12:45, Masahiro Ikeda wrote:
On 2021/05/21 10:39, Masahiko Sawada wrote:
On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:
How much is the performance without those 2PC patches and with the
same workload? i.e., how fast is the current postgres_fdw that uses
XactCallback?

OK, I'll test.

The test results are followings. But, I couldn't confirm the performance
improvements of 2PC patches though I may need to be changed the test condition.

[condition]
* 1 coordinator and 3 foreign servers
* There are two custom scripts which access different two foreign servers per
transaction

``` fxact_select.pgbench
BEGIN;
SELECT * FROM part:p1 WHERE id = :id;
SELECT * FROM part:p2 WHERE id = :id;
COMMIT;
```

``` fxact_update.pgbench
BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

[results]

I have tested three times.
Performance difference seems to be within the range of errors.

# 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
- fxact_update.pgbench
72.3, 74.9, 77.5  TPS  => avg 74.9 TPS
110.5, 106.8, 103.2  ms => avg 106.8 ms

- fxact_select.pgbench
1767.6, 1737.1, 1717.4 TPS  => avg 1740.7 TPS
4.5, 4.6, 4.7 ms => avg 4.6ms

# 6d0eb38557 without 2pc patches
- fxact_update.pgbench
76.5, 70.6, 69.5 TPS => avg 72.2 TPS
104.534 + 113.244 + 115.097 => avg 111.0 ms

-fxact_select.pgbench
1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
4.2, 4.6, 4.6 ms=>  4.5 ms


Thank you for testing!

I think the result shows that managing foreign transactions on the
core side would not be a problem in terms of performance.





# About the bottleneck of the resolver process

I investigated the performance bottleneck of the resolver process using perf.
The main bottleneck is the following functions.

1st. 42.8% routine->CommitForeignTransaction()
2nd. 31.5% remove_fdwxact()
3rd. 10.16% CommitTransaction()

1st and 3rd problems can be solved by parallelizing resolver processes per
remote servers. But, I wondered that the idea, which backends call also
"COMMIT/ABORT PREPARED" and the resolver process only takes changes of
resolving in-doubt foreign transactions, is better. In many cases, I think
that the number of connections is much greater than the number of remote
servers. If so, the parallelization is not enough.

So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
the extension. So, this idea is not bad.

Thank you for pointing it out. This idea has been proposed several
times and there were discussions. I'd like to summarize the proposed
ideas and those pros and cons before replying to your other comments.

There are 3 ideas. After backend both prepares all foreign transaction
and commit the local transaction,

1. the backend continues attempting to commit all prepared foreign
transactions until all of them are committed.
2. the backend attempts to commit all prepared foreign transactions
once. If an error happens, leave them for the resolver.
3. the backend asks the resolver that launched per foreign server to
commit the prepared foreign transactions (and backend waits or doesn't
wait for the commit completion depending on the setting).

With ideas 1 and 2, since the backend itself commits all foreign
transactions the resolver process cannot be a bottleneck, and probably
the code can get more simple as backends don't need to communicate
with resolver processes.

However, those have two problems we need to deal with:

Thanks for sharing the summarize. I understood there are problems related to
FDW implementation.

First, users could get an error if an error happens during the backend
committing prepared foreign transaction but the local transaction is
already committed and some foreign transactions could also be
committed, confusing users. There were two opinions to this problem:
FDW developers should be responsible for writing FDW code such that
any error doesn't happen during committing foreign transactions, and
users can accept that confusion since an error could happen after
writing the commit WAL even today without this 2PC feature. For the
former point, I'm not sure it's always doable since even palloc()
could raise an error and it seems hard to require all FDW developers
to understand all possible paths of raising an error. And for the
latter point, that's true but I think those cases are
should-not-happen cases (i.g., rare cases) whereas the likelihood of
an error during committing prepared transactions is not low (e.g., by
network connectivity problem). I think we need to assume that that is
not a rare case.

Hmm… Sorry, I don’t have any good ideas now.

If anything, I’m on second side which users accept the confusion though 
let users know a error happens before local commit is done or not is necessary
because if the former case, users will execute the same query again.


The second problem is whether we can cancel committing foreign
transactions by pg_cancel_backend() (or pressing Ctl-c). If the
backend process commits prepared foreign transactions, it's FDW
developers' responsibility to write code that is interruptible. I’m
not sure it’s feasible for drivers for other databases.

Sorry, my understanding is not clear.

After all prepares are done, the foreign transactions will be committed.
So, does this mean that FDW must leave the unresolved transaction to the transaction
resolver and show some messages like “Since the transaction is already committed,
the transaction will be resolved in background" ?


Idea 3 is proposed to deal with those problems. By having separate
processes, resolver processes, committing prepared foreign
transactions, we and FDW developers don't need to worry about those
two problems.

However as Ikeda-san shared the performance results, idea 3 is likely
to have a performance problem since resolver processes can easily be
bottle-neck. Moreover, with the current patch, since we asynchronously
commit foreign prepared transactions, if many concurrent clients use
2PC, reaching max_foreign_prepared_transactions,  transactions end up
with an error.

Through the long discussion on this thread, I've been thought we got a
consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for
dealing with the performance problem. Idea 1 and 2 are also good and
attractive, but I think we need to deal with the two problems first if
we go with one of those ideas. To be honest, I'm really not sure it's
good if we make those things FDW developers responsibility.

As long as we commit foreign prepared transactions asynchronously and
there is max_foreign_prepared_transactions limit, it's possible that
committing those transactions could not keep up. Maybe the same is
true for a case where the client heavily uses 2PC and asynchronously
commits prepared transactions. If committing prepared transactions
doesn't keep up with preparing transactions, the system reaches
max_prepared_transactions.

With the current patch, we commit prepared foreign transactions
asynchronously. But maybe we need to compare the performance of ideas
1 (and 2) to idea 3 with synchronous foreign transaction resolution.

OK, I understood the consensus is 3rd one. I agree it since I don’t have any solutions 
For the problems related 1st and 2nd. If I find them, I’ll share you.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
1. the backend continues attempting to commit all prepared foreign
> transactions until all of them are committed.
> 2. the backend attempts to commit all prepared foreign transactions
> once. If an error happens, leave them for the resolver.
> 3. the backend asks the resolver that launched per foreign server to
> commit the prepared foreign transactions (and backend waits or doesn't
> wait for the commit completion depending on the setting).
> 
> With ideas 1 and 2, since the backend itself commits all foreign
> transactions the resolver process cannot be a bottleneck, and probably
> the code can get more simple as backends don't need to communicate
> with resolver processes.
> 
> However, those have two problems we need to deal with:
> 

> First, users could get an error if an error happens during the backend
> committing prepared foreign transaction but the local transaction is
> already committed and some foreign transactions could also be
> committed, confusing users. There were two opinions to this problem:
> FDW developers should be responsible for writing FDW code such that
> any error doesn't happen during committing foreign transactions, and
> users can accept that confusion since an error could happen after
> writing the commit WAL even today without this 2PC feature. 

Why does the user have to get an error?  Once the local transaction has been prepared, which means all remote ones also
havebeen prepared, the whole transaction is determined to commit.  So, the user doesn't have to receive an error as
longas the local node is alive.
 


> For the
> former point, I'm not sure it's always doable since even palloc()
> could raise an error and it seems hard to require all FDW developers
> to understand all possible paths of raising an error.

No, this is a matter of discipline to ensure consistency, just in case we really have to return an error to the user.


> And for the
> latter point, that's true but I think those cases are
> should-not-happen cases (i.g., rare cases) whereas the likelihood of
> an error during committing prepared transactions is not low (e.g., by
> network connectivity problem). I think we need to assume that that is
> not a rare case.

How do non-2PC and 2PC cases differ in the rarity of the error?


> The second problem is whether we can cancel committing foreign
> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> backend process commits prepared foreign transactions, it's FDW
> developers' responsibility to write code that is interruptible. I’m
> not sure it’s feasible for drivers for other databases.

That's true not only for prepare and commit but also for other queries.  Why do we have to treat prepare and commit
specially?


> Through the long discussion on this thread, I've been thought we got a
> consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for

I don't remember seeing any consensus yet?

> With the current patch, we commit prepared foreign transactions
> asynchronously. But maybe we need to compare the performance of ideas
> 1 (and 2) to idea 3 with synchronous foreign transaction resolution.

+1


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
<ikedamsh@oss.nttdata.com> wrote:
>
>
>
> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
>
> Thank you for pointing it out. This idea has been proposed several
> times and there were discussions. I'd like to summarize the proposed
> ideas and those pros and cons before replying to your other comments.
>
> There are 3 ideas. After backend both prepares all foreign transaction
> and commit the local transaction,
>
> 1. the backend continues attempting to commit all prepared foreign
> transactions until all of them are committed.
> 2. the backend attempts to commit all prepared foreign transactions
> once. If an error happens, leave them for the resolver.
> 3. the backend asks the resolver that launched per foreign server to
> commit the prepared foreign transactions (and backend waits or doesn't
> wait for the commit completion depending on the setting).
>
> With ideas 1 and 2, since the backend itself commits all foreign
> transactions the resolver process cannot be a bottleneck, and probably
> the code can get more simple as backends don't need to communicate
> with resolver processes.
>
> However, those have two problems we need to deal with:
>
>
> Thanks for sharing the summarize. I understood there are problems related to
> FDW implementation.
>
> First, users could get an error if an error happens during the backend
> committing prepared foreign transaction but the local transaction is
> already committed and some foreign transactions could also be
> committed, confusing users. There were two opinions to this problem:
> FDW developers should be responsible for writing FDW code such that
> any error doesn't happen during committing foreign transactions, and
> users can accept that confusion since an error could happen after
> writing the commit WAL even today without this 2PC feature. For the
> former point, I'm not sure it's always doable since even palloc()
> could raise an error and it seems hard to require all FDW developers
> to understand all possible paths of raising an error. And for the
> latter point, that's true but I think those cases are
> should-not-happen cases (i.g., rare cases) whereas the likelihood of
> an error during committing prepared transactions is not low (e.g., by
> network connectivity problem). I think we need to assume that that is
> not a rare case.
>
>
> Hmm… Sorry, I don’t have any good ideas now.
>
> If anything, I’m on second side which users accept the confusion though
> let users know a error happens before local commit is done or not is necessary
> because if the former case, users will execute the same query again.

Yeah, users will need to remember the XID of the last executed
transaction and check if it has been committed by pg_xact_status().

>
>
> The second problem is whether we can cancel committing foreign
> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> backend process commits prepared foreign transactions, it's FDW
> developers' responsibility to write code that is interruptible. I’m
> not sure it’s feasible for drivers for other databases.
>
>
> Sorry, my understanding is not clear.
>
> After all prepares are done, the foreign transactions will be committed.
> So, does this mean that FDW must leave the unresolved transaction to the transaction
> resolver and show some messages like “Since the transaction is already committed,
> the transaction will be resolved in background" ?

I think this would happen after the backend cancels COMMIT PREPARED.
To be able to cancel an in-progress query the backend needs to accept
the interruption and send the cancel request. postgres_fdw can do that
since libpq supports sending a query and waiting for the result but
I’m not sure about other drivers.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> 1. the backend continues attempting to commit all prepared foreign
> > transactions until all of them are committed.
> > 2. the backend attempts to commit all prepared foreign transactions
> > once. If an error happens, leave them for the resolver.
> > 3. the backend asks the resolver that launched per foreign server to
> > commit the prepared foreign transactions (and backend waits or doesn't
> > wait for the commit completion depending on the setting).
> >
> > With ideas 1 and 2, since the backend itself commits all foreign
> > transactions the resolver process cannot be a bottleneck, and probably
> > the code can get more simple as backends don't need to communicate
> > with resolver processes.
> >
> > However, those have two problems we need to deal with:
> >
>
> > First, users could get an error if an error happens during the backend
> > committing prepared foreign transaction but the local transaction is
> > already committed and some foreign transactions could also be
> > committed, confusing users. There were two opinions to this problem:
> > FDW developers should be responsible for writing FDW code such that
> > any error doesn't happen during committing foreign transactions, and
> > users can accept that confusion since an error could happen after
> > writing the commit WAL even today without this 2PC feature.
>
> Why does the user have to get an error?  Once the local transaction has been prepared, which means all remote ones
alsohave been prepared, the whole transaction is determined to commit.  So, the user doesn't have to receive an error
aslong as the local node is alive. 

I think we should neither ignore the error thrown by FDW code nor
lower the error level (e.g., ERROR to WARNING).

>
> > And for the
> > latter point, that's true but I think those cases are
> > should-not-happen cases (i.g., rare cases) whereas the likelihood of
> > an error during committing prepared transactions is not low (e.g., by
> > network connectivity problem). I think we need to assume that that is
> > not a rare case.
>
> How do non-2PC and 2PC cases differ in the rarity of the error?

I think the main difference would be that in 2PC case there will be
network communications possibly with multiple servers after the local
commit.

>
>
> > The second problem is whether we can cancel committing foreign
> > transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> > backend process commits prepared foreign transactions, it's FDW
> > developers' responsibility to write code that is interruptible. I’m
> > not sure it’s feasible for drivers for other databases.
>
> That's true not only for prepare and commit but also for other queries.  Why do we have to treat prepare and commit
specially?

Good point. This would not be a blocker for ideas 1 and 2 but is a
side benefit of idea 3.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Why does the user have to get an error?  Once the local transaction has been
> prepared, which means all remote ones also have been prepared, the whole
> transaction is determined to commit.  So, the user doesn't have to receive an
> error as long as the local node is alive.
> 
> I think we should neither ignore the error thrown by FDW code nor
> lower the error level (e.g., ERROR to WARNING).

Why?  (Forgive me for asking relentlessly... by imagining me as a cute 7-year-old boy/girl asking "Why Dad?")


> > How do non-2PC and 2PC cases differ in the rarity of the error?
> 
> I think the main difference would be that in 2PC case there will be
> network communications possibly with multiple servers after the local
> commit.

Then, it's the same failure mode.  That is, the same failure could occur for both cases.  That doesn't require us to
differentiatebetween them.  Let's ignore this point from now on.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jun 4, 2021 at 5:59 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Why does the user have to get an error?  Once the local transaction has been
> > prepared, which means all remote ones also have been prepared, the whole
> > transaction is determined to commit.  So, the user doesn't have to receive an
> > error as long as the local node is alive.
> >
> > I think we should neither ignore the error thrown by FDW code nor
> > lower the error level (e.g., ERROR to WARNING).
>
> Why?  (Forgive me for asking relentlessly... by imagining me as a cute 7-year-old boy/girl asking "Why Dad?")

I think we should not reinterpret the severity of the error and lower
it. Especially, in this case, any kind of errors can be thrown. It
could be such a serious error that FDW developer wants to report to
the client. Do we lower even PANIC to a lower severity such as
WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
lowering ERROR (and FATAL) to WARNING, why do we regard only them as
non-error?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jun 4, 2021 at 5:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
> <ikedamsh@oss.nttdata.com> wrote:
> >
> >
> >
> > 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
> >
> >
> > Thank you for pointing it out. This idea has been proposed several
> > times and there were discussions. I'd like to summarize the proposed
> > ideas and those pros and cons before replying to your other comments.
> >
> > There are 3 ideas. After backend both prepares all foreign transaction
> > and commit the local transaction,
> >
> > 1. the backend continues attempting to commit all prepared foreign
> > transactions until all of them are committed.
> > 2. the backend attempts to commit all prepared foreign transactions
> > once. If an error happens, leave them for the resolver.
> > 3. the backend asks the resolver that launched per foreign server to
> > commit the prepared foreign transactions (and backend waits or doesn't
> > wait for the commit completion depending on the setting).
> >
> > With ideas 1 and 2, since the backend itself commits all foreign
> > transactions the resolver process cannot be a bottleneck, and probably
> > the code can get more simple as backends don't need to communicate
> > with resolver processes.
> >
> > However, those have two problems we need to deal with:
> >
> >
> > Thanks for sharing the summarize. I understood there are problems related to
> > FDW implementation.
> >
> > First, users could get an error if an error happens during the backend
> > committing prepared foreign transaction but the local transaction is
> > already committed and some foreign transactions could also be
> > committed, confusing users. There were two opinions to this problem:
> > FDW developers should be responsible for writing FDW code such that
> > any error doesn't happen during committing foreign transactions, and
> > users can accept that confusion since an error could happen after
> > writing the commit WAL even today without this 2PC feature. For the
> > former point, I'm not sure it's always doable since even palloc()
> > could raise an error and it seems hard to require all FDW developers
> > to understand all possible paths of raising an error. And for the
> > latter point, that's true but I think those cases are
> > should-not-happen cases (i.g., rare cases) whereas the likelihood of
> > an error during committing prepared transactions is not low (e.g., by
> > network connectivity problem). I think we need to assume that that is
> > not a rare case.
> >
> >
> > Hmm… Sorry, I don’t have any good ideas now.
> >
> > If anything, I’m on second side which users accept the confusion though
> > let users know a error happens before local commit is done or not is necessary
> > because if the former case, users will execute the same query again.
>
> Yeah, users will need to remember the XID of the last executed
> transaction and check if it has been committed by pg_xact_status().

As the second idea, can we send something like a hint along with the
error (or send a new type of error) that indicates the error happened
after the transaction commit so that the client can decide whether or
not to ignore the error? That way, we can deal with the confusion led
by an error raised after the local commit by the existing post-commit
cleanup routines (and post-commit xact callbacks) as well as by FDW’s
commit prepared routine.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
"ikedamsh@oss.nttdata.com"
Date:

> 2021/06/04 17:16、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
> <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>>
>> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>>
>>
>> Thank you for pointing it out. This idea has been proposed several
>> times and there were discussions. I'd like to summarize the proposed
>> ideas and those pros and cons before replying to your other comments.
>>
>> There are 3 ideas. After backend both prepares all foreign transaction
>> and commit the local transaction,
>>
>> 1. the backend continues attempting to commit all prepared foreign
>> transactions until all of them are committed.
>> 2. the backend attempts to commit all prepared foreign transactions
>> once. If an error happens, leave them for the resolver.
>> 3. the backend asks the resolver that launched per foreign server to
>> commit the prepared foreign transactions (and backend waits or doesn't
>> wait for the commit completion depending on the setting).
>>
>> With ideas 1 and 2, since the backend itself commits all foreign
>> transactions the resolver process cannot be a bottleneck, and probably
>> the code can get more simple as backends don't need to communicate
>> with resolver processes.
>>
>> However, those have two problems we need to deal with:
>>
>>
>> Thanks for sharing the summarize. I understood there are problems related to
>> FDW implementation.
>>
>> First, users could get an error if an error happens during the backend
>> committing prepared foreign transaction but the local transaction is
>> already committed and some foreign transactions could also be
>> committed, confusing users. There were two opinions to this problem:
>> FDW developers should be responsible for writing FDW code such that
>> any error doesn't happen during committing foreign transactions, and
>> users can accept that confusion since an error could happen after
>> writing the commit WAL even today without this 2PC feature. For the
>> former point, I'm not sure it's always doable since even palloc()
>> could raise an error and it seems hard to require all FDW developers
>> to understand all possible paths of raising an error. And for the
>> latter point, that's true but I think those cases are
>> should-not-happen cases (i.g., rare cases) whereas the likelihood of
>> an error during committing prepared transactions is not low (e.g., by
>> network connectivity problem). I think we need to assume that that is
>> not a rare case.
>>
>>
>> Hmm… Sorry, I don’t have any good ideas now.
>>
>> If anything, I’m on second side which users accept the confusion though
>> let users know a error happens before local commit is done or not is necessary
>> because if the former case, users will execute the same query again.
>
> Yeah, users will need to remember the XID of the last executed
> transaction and check if it has been committed by pg_xact_status().
>
>>
>>
>> The second problem is whether we can cancel committing foreign
>> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
>> backend process commits prepared foreign transactions, it's FDW
>> developers' responsibility to write code that is interruptible. I’m
>> not sure it’s feasible for drivers for other databases.
>>
>>
>> Sorry, my understanding is not clear.
>>
>> After all prepares are done, the foreign transactions will be committed.
>> So, does this mean that FDW must leave the unresolved transaction to the transaction
>> resolver and show some messages like “Since the transaction is already committed,
>> the transaction will be resolved in background" ?
>
> I think this would happen after the backend cancels COMMIT PREPARED.
> To be able to cancel an in-progress query the backend needs to accept
> the interruption and send the cancel request. postgres_fdw can do that
> since libpq supports sending a query and waiting for the result but
> I’m not sure about other drivers.

Thanks, I understood that handling this issue is not scope of the 2PC feature
as Tsunakawa-san and you said,

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: Transactions involving multiple postgres foreign servers, take 2

From
"ikedamsh@oss.nttdata.com"
Date:

> 2021/06/04 21:38、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
> On Fri, Jun 4, 2021 at 5:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
>> <ikedamsh@oss.nttdata.com> wrote:
>>>
>>>
>>>
>>> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>>>
>>>
>>> Thank you for pointing it out. This idea has been proposed several
>>> times and there were discussions. I'd like to summarize the proposed
>>> ideas and those pros and cons before replying to your other comments.
>>>
>>> There are 3 ideas. After backend both prepares all foreign transaction
>>> and commit the local transaction,
>>>
>>> 1. the backend continues attempting to commit all prepared foreign
>>> transactions until all of them are committed.
>>> 2. the backend attempts to commit all prepared foreign transactions
>>> once. If an error happens, leave them for the resolver.
>>> 3. the backend asks the resolver that launched per foreign server to
>>> commit the prepared foreign transactions (and backend waits or doesn't
>>> wait for the commit completion depending on the setting).
>>>
>>> With ideas 1 and 2, since the backend itself commits all foreign
>>> transactions the resolver process cannot be a bottleneck, and probably
>>> the code can get more simple as backends don't need to communicate
>>> with resolver processes.
>>>
>>> However, those have two problems we need to deal with:
>>>
>>>
>>> Thanks for sharing the summarize. I understood there are problems related to
>>> FDW implementation.
>>>
>>> First, users could get an error if an error happens during the backend
>>> committing prepared foreign transaction but the local transaction is
>>> already committed and some foreign transactions could also be
>>> committed, confusing users. There were two opinions to this problem:
>>> FDW developers should be responsible for writing FDW code such that
>>> any error doesn't happen during committing foreign transactions, and
>>> users can accept that confusion since an error could happen after
>>> writing the commit WAL even today without this 2PC feature. For the
>>> former point, I'm not sure it's always doable since even palloc()
>>> could raise an error and it seems hard to require all FDW developers
>>> to understand all possible paths of raising an error. And for the
>>> latter point, that's true but I think those cases are
>>> should-not-happen cases (i.g., rare cases) whereas the likelihood of
>>> an error during committing prepared transactions is not low (e.g., by
>>> network connectivity problem). I think we need to assume that that is
>>> not a rare case.
>>>
>>>
>>> Hmm… Sorry, I don’t have any good ideas now.
>>>
>>> If anything, I’m on second side which users accept the confusion though
>>> let users know a error happens before local commit is done or not is necessary
>>> because if the former case, users will execute the same query again.
>>
>> Yeah, users will need to remember the XID of the last executed
>> transaction and check if it has been committed by pg_xact_status().
>
> As the second idea, can we send something like a hint along with the
> error (or send a new type of error) that indicates the error happened
> after the transaction commit so that the client can decide whether or
> not to ignore the error? That way, we can deal with the confusion led
> by an error raised after the local commit by the existing post-commit
> cleanup routines (and post-commit xact callbacks) as well as by FDW’s
> commit prepared routine.


I think your second idea is better because it’s easier for users to know what
error happens and there is nothing users should do. Since the focus of "hint”
is how to fix the problem, is it appropriate to use "context”?

FWIF, I took a fast look to elog.c and I found there is “error_context_stack”.
So, why don’t you add the context which shows like "the transaction fate is
decided to COMMIT (or ROLLBACK). So, even if error happens, the transaction
will be resolved in background” after the local commit?


Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION




RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
> I think we should not reinterpret the severity of the error and lower
> it. Especially, in this case, any kind of errors can be thrown. It
> could be such a serious error that FDW developer wants to report to
> the client. Do we lower even PANIC to a lower severity such as
> WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> non-error?

Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

FYI, the tx_commit() in the X/Open TX interface and the UserTransaction.commit() in JTA don't return such an error,
IIRC. Do TX_FAIL and SystemException serve such a purpose?  I don't feel like that.
 


[Tuxedo manual (Japanese)]
https://docs.oracle.com/cd/F25597_01/document/products/tuxedo/tux80j/atmi/rf3c91.htm


[JTA]
public interface javax.transaction.UserTransaction 
public void commit()
 throws RollbackException, HeuristicMixedException, 
HeuristicRollbackException, SecurityException, 
IllegalStateException, SystemException 

Throws: RollbackException 
Thrown to indicate that the transaction has been rolled back rather than committed. 

Throws: HeuristicMixedException 
Thrown to indicate that a heuristic decision was made and that some relevant updates have been 
committed while others have been rolled back. 

Throws: HeuristicRollbackException 
Thrown to indicate that a heuristic decision was made and that all relevant updates have been rolled 
back. 

Throws: SecurityException 
Thrown to indicate that the thread is not allowed to commit the transaction. 

Throws: IllegalStateException 
Thrown if the current thread is not associated with a transaction. 

Throws: SystemException 
Thrown if the transaction manager encounters an unexpected error condition. 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > I think we should not reinterpret the severity of the error and lower
> > it. Especially, in this case, any kind of errors can be thrown. It
> > could be such a serious error that FDW developer wants to report to
> > the client. Do we lower even PANIC to a lower severity such as
> > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > non-error?
>
> Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

It's not necessarily on a remote server. It could be a problem with
the local server.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
(I have caught up here. Sorry in advance for possible pointless
discussion by me..)

At Tue, 8 Jun 2021 00:47:08 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > I think we should not reinterpret the severity of the error and lower
> > it. Especially, in this case, any kind of errors can be thrown. It
> > could be such a serious error that FDW developer wants to report to
> > the client. Do we lower even PANIC to a lower severity such as
> > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > non-error?
> 
> Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

I think the discussion is based the behavior that any process that is
responsible for finishing the 2pc-commit continue retrying remote
commits until all of the remote-commits succeed.

Maybe in most cases the errors duing remote-prepared-commit could be
retry-able but as Sawada-san says I'm also not sure it's always the
case.  On the other hand, it could be said that we have no other way
than retrying the remote-commits if we want to get over, say, instant
network failures automatically.  It is somewhat similar to
WAL-restoration that continues complaining for recovery_commands
failure without exiting.

> FYI, the tx_commit() in the X/Open TX interface and the UserTransaction.commit() in JTA don't return such an error,
IIRC. Do TX_FAIL and SystemException serve such a purpose?  I don't feel like that.
 

I'm not sure about how JTA works in detail, but doesn't
UserTransaction.commit() return HeuristicMixedExcpetion when some of
relevant updates have been committed but other not? Isn't it the same
state with the case where some of the remote servers failed on
remote-commit while others are succeeded?  (I guess that
UserTransaction.commit() would throw RollbackException if
remote-prepare has been failed for any of the remotes.)


> [Tuxedo manual (Japanese)]
> https://docs.oracle.com/cd/F25597_01/document/products/tuxedo/tux80j/atmi/rf3c91.htm
> 
> 
> [JTA]
> public interface javax.transaction.UserTransaction 
> public void commit()
>  throws RollbackException, HeuristicMixedException, 
> HeuristicRollbackException, SecurityException, 
> IllegalStateException, SystemException 
> 
> Throws: RollbackException 
> Thrown to indicate that the transaction has been rolled back rather than committed. 
> 
> Throws: HeuristicMixedException 
> Thrown to indicate that a heuristic decision was made and that some relevant updates have been 
> committed while others have been rolled back. 
> 
> Throws: HeuristicRollbackException 
> Thrown to indicate that a heuristic decision was made and that all relevant updates have been rolled 
> back. 
> 
> Throws: SecurityException 
> Thrown to indicate that the thread is not allowed to commit the transaction. 
> 
> Throws: IllegalStateException 
> Thrown if the current thread is not associated with a transaction. 
> 
> Throws: SystemException 
> Thrown if the transaction manager encounters an unexpected error condition. 
> 
> 
> Regards
> Takayuki Tsunakawa

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 8 Jun 2021 16:32:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <sawada.mshk@gmail.com>
> > > I think we should not reinterpret the severity of the error and lower
> > > it. Especially, in this case, any kind of errors can be thrown. It
> > > could be such a serious error that FDW developer wants to report to
> > > the client. Do we lower even PANIC to a lower severity such as
> > > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > > non-error?
> >
> > Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?
> 
> It's not necessarily on a remote server. It could be a problem with
> the local server.

Isn't it a discussion about the errors from postgres_fdw?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Why does the client have to know the error on a remote server, whereas the
> global transaction itself is destined to commit?
> 
> It's not necessarily on a remote server. It could be a problem with
> the local server.

Then, in what kind of scenario are we talking about the difficulty, and how is it difficult to handle, when we adopt
eitherthe method 1 or 2?  (I'd just like to have the same clear picture.)  For example,
 

1. All FDWs prepared successfully.
2. The local transaction prepared successfully, too.
3. Some FDWs committed successfully.
4. One FDW failed to send the commit request because the remote server went down.


Regards
Takayuki Tsunakawa


RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> I think the discussion is based the behavior that any process that is
> responsible for finishing the 2pc-commit continue retrying remote
> commits until all of the remote-commits succeed.

Thank you for coming back.  We're talking about the first attempt to prepare and commit in each transaction, not the
retrycase.
 


> > Throws: HeuristicMixedException
> > Thrown to indicate that a heuristic decision was made and that some
> relevant updates have been
> > committed while others have been rolled back.

> I'm not sure about how JTA works in detail, but doesn't
> UserTransaction.commit() return HeuristicMixedExcpetion when some of
> relevant updates have been committed but other not? Isn't it the same
> state with the case where some of the remote servers failed on
> remote-commit while others are succeeded?

No.  Taking the description literally and considering the relevant XA specification, it's not about the remote commit
failure. The remote server is not allowed to fail the commit once it has reported successful prepare, which is the
contractof 2PC.  HeuristicMixedException is about the manual resolution, typically by the DBA, using the DBMS-specific
toolor the standard commit()/rollback() API.
 


> (I guess that
> UserTransaction.commit() would throw RollbackException if
> remote-prepare has been failed for any of the remotes.)

Correct.


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Why does the client have to know the error on a remote server, whereas the
> > global transaction itself is destined to commit?
> >
> > It's not necessarily on a remote server. It could be a problem with
> > the local server.
>
> Then, in what kind of scenario are we talking about the difficulty, and how is it difficult to handle, when we adopt
eitherthe method 1 or 2?  (I'd just like to have the same clear picture.) 

IMO, even though FDW's commit/rollback transaction code could be
simple in some cases, I think we need to think that any kind of errors
(or even FATAL or PANIC) could be thrown from the FDW code. It could
be an error due to a temporary network problem, remote server down,
driver’s unexpected error, or out of memory etc. Errors that happened
after the local transaction commit doesn't affect the global
transaction decision, as you mentioned. But the proccess or system
could be in a bad state. Also, users might expect the process to exit
on error by setting  exit_on_error = on. Your idea sounds like that we
have to ignore any errors happening after the local commit if they
don’t affect the transaction outcome. It’s too scary to me and I think
that it's a bad idea to blindly ignore all possible errors under such
conditions. That could make the thing worse and will likely be
foot-gun. It would be good if we can prove that it’s safe to ignore
those errors but not sure how we can at least for me.

This situation is true even today; an error could happen after
committing the transaction. But I personally don’t want to add the
code that increases the likelihood.

Just to be clear, with your idea, we will ignore only ERROR or also
FATAL and PANIC? And if an error happens during committing one of the
prepared transactions on the foreign server, will we proceed with
committing other transactions or return OK to the client?


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Then, in what kind of scenario are we talking about the difficulty, and how is
> it difficult to handle, when we adopt either the method 1 or 2?  (I'd just like to
> have the same clear picture.)
> 
> IMO, even though FDW's commit/rollback transaction code could be
> simple in some cases, I think we need to think that any kind of errors
> (or even FATAL or PANIC) could be thrown from the FDW code. It could
> be an error due to a temporary network problem, remote server down,
> driver’s unexpected error, or out of memory etc. Errors that happened
> after the local transaction commit doesn't affect the global
> transaction decision, as you mentioned. But the proccess or system
> could be in a bad state. Also, users might expect the process to exit
> on error by setting  exit_on_error = on. Your idea sounds like that we
> have to ignore any errors happening after the local commit if they
> don’t affect the transaction outcome. It’s too scary to me and I think
> that it's a bad idea to blindly ignore all possible errors under such
> conditions. That could make the thing worse and will likely be
> foot-gun. It would be good if we can prove that it’s safe to ignore
> those errors but not sure how we can at least for me.
> 
> This situation is true even today; an error could happen after
> committing the transaction. But I personally don’t want to add the
> code that increases the likelihood.

I'm not talking about the code simplicity here (actually, I haven't reviewed the code around prepare and commit in the
patchyet...)  Also, I don't understand well what you're trying to insist and what realistic situations you have in mind
byciting exit_on_error, FATAL, PANIC and so on.  I just asked (in a different part) why the client has to know the
error.

Just to be clear, I'm not saying that we should hide the error completely behind the scenes.  For example, you can
allowthe FDW to emit a WARNING if the DBMS-specific client driver returns an error when committing.  Further, if you
wantto allow the FDW to throw an ERROR when committing, the transaction manager in core can catch it by PG_TRY(), so
thatit can report back successfull commit of the global transaction to the client while it leaves the handling of
failedcommit of the FDW to the resolver.  (I don't think we like to use PG_TRY() during transaction commit for
performancereasons, though.)
 

Let's give it a hundred steps and let's say we want to report the error of the committing FDW to the client.  If that's
thecase, we can use SQLSTATE 02xxx (Warning) and attach the error message.
 


> Just to be clear, with your idea, we will ignore only ERROR or also
> FATAL and PANIC? And if an error happens during committing one of the
> prepared transactions on the foreign server, will we proceed with
> committing other transactions or return OK to the client?

Neither FATAL nor PANIC can be ignored.  When FATAL, which means the termination of a particular session, the
committingof the remote transaction should be taken over by the resolver.  Not to mention PANIC; we can't do anything.
Otherwise,we proceed with committing other FDWs, hand off the task of committing the failed FDW to the resolver, and
reportsuccess to the client.  If you're not convinced, I'd like to ask you to investigate the code of some Java EE app
server,say GlassFish, and share with us how it handles an error during commit.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Wed, Jun 9, 2021 at 4:10 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Then, in what kind of scenario are we talking about the difficulty, and how is
> > it difficult to handle, when we adopt either the method 1 or 2?  (I'd just like to
> > have the same clear picture.)
> >
> > IMO, even though FDW's commit/rollback transaction code could be
> > simple in some cases, I think we need to think that any kind of errors
> > (or even FATAL or PANIC) could be thrown from the FDW code. It could
> > be an error due to a temporary network problem, remote server down,
> > driver’s unexpected error, or out of memory etc. Errors that happened
> > after the local transaction commit doesn't affect the global
> > transaction decision, as you mentioned. But the proccess or system
> > could be in a bad state. Also, users might expect the process to exit
> > on error by setting  exit_on_error = on. Your idea sounds like that we
> > have to ignore any errors happening after the local commit if they
> > don’t affect the transaction outcome. It’s too scary to me and I think
> > that it's a bad idea to blindly ignore all possible errors under such
> > conditions. That could make the thing worse and will likely be
> > foot-gun. It would be good if we can prove that it’s safe to ignore
> > those errors but not sure how we can at least for me.
> >
> > This situation is true even today; an error could happen after
> > committing the transaction. But I personally don’t want to add the
> > code that increases the likelihood.
>
> I'm not talking about the code simplicity here (actually, I haven't reviewed the code around prepare and commit in
thepatch yet...)  Also, I don't understand well what you're trying to insist and what realistic situations you have in
mindby citing exit_on_error, FATAL, PANIC and so on.  I just asked (in a different part) why the client has to know the
error.
>
> Just to be clear, I'm not saying that we should hide the error completely behind the scenes.  For example, you can
allowthe FDW to emit a WARNING if the DBMS-specific client driver returns an error when committing.  Further, if you
wantto allow the FDW to throw an ERROR when committing, the transaction manager in core can catch it by PG_TRY(), so
thatit can report back successfull commit of the global transaction to the client while it leaves the handling of
failedcommit of the FDW to the resolver.  (I don't think we like to use PG_TRY() during transaction commit for
performancereasons, though.) 
>
> Let's give it a hundred steps and let's say we want to report the error of the committing FDW to the client.  If
that'sthe case, we can use SQLSTATE 02xxx (Warning) and attach the error message. 
>

Maybe it's better to start a new thread to discuss this topic. If your
idea is good, we can lower all error that happened after writing the
commit record to warning, reducing the cases where the client gets
confusion by receiving an error after the commit.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Masahiko Sawada <sawada.mshk@gmail.com>
> Maybe it's better to start a new thread to discuss this topic. If your
> idea is good, we can lower all error that happened after writing the
> commit record to warning, reducing the cases where the client gets
> confusion by receiving an error after the commit.

No.  It's an important part because it determines the 2PC behavior and performance.  This discussion had started from
theconcern about performance before Ikeda-san reported pathological results.  Don't rush forward, hoping someone will
committhe current patch.  I'm afraid you just don't want to change your design and code.  Let's face the real issue.
 

As I said before, and as Ikeda-san's performance benchmark results show, I have to say the design isn't done
sufficiently. I talked with Fujii-san the other day about this patch.  The patch is already huge and it's difficult to
decodehow the patch works, e.g., what kind of new WALs it emits, how many disk writes it adds, how the error is
handled,whether/how it's different from the textbook or other existing designs, etc.  What happend to my request to add
suchdesign description to the following page, so that reviewers can consider the design before spending much time on
lookingat the code?  What's the situation of the new FDW API that should naturally accommodate other FDW
implementations?

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Design should come first.  I don't think it's a sincere attitude to require reviewers to spend long time to read the
designfrom huge code.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From
Kyotaro Horiguchi
Date:
At Tue, 8 Jun 2021 08:45:24 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > I think the discussion is based the behavior that any process that is
> > responsible for finishing the 2pc-commit continue retrying remote
> > commits until all of the remote-commits succeed.
> 
> Thank you for coming back.  We're talking about the first attempt to prepare and commit in each transaction, not the
retrycase.
 

If we accept each elementary-commit (via FDW connection) to fail, the
parent(?) there's no way the root 2pc-commit can succeed.  How can we
ignore the fdw-error in that case?

> > > Throws: HeuristicMixedException
> > > Thrown to indicate that a heuristic decision was made and that some
> > relevant updates have been
> > > committed while others have been rolled back.
> 
> > I'm not sure about how JTA works in detail, but doesn't
> > UserTransaction.commit() return HeuristicMixedExcpetion when some of
> > relevant updates have been committed but other not? Isn't it the same
> > state with the case where some of the remote servers failed on
> > remote-commit while others are succeeded?
> 
> No.  Taking the description literally and considering the relevant XA specification, it's not about the remote commit
failure. The remote server is not allowed to fail the commit once it has reported successful prepare, which is the
contractof 2PC.  HeuristicMixedException is about the manual resolution, typically by the DBA, using the DBMS-specific
toolor the standard commit()/rollback() API.
 

Mmm. The above seems as if saying that 2pc-comit does not interact
with remotes.  The interface contract does not cover everything that
happens in the real world. If remote-commit fails, that is just an
issue outside of the 2pc world.  In reality remote-commit may fail for
all reasons.

https://www.ibm.com/docs/ja/db2-for-zos/11?topic=support-example-distributed-transaction-that-uses-jta-methods

>      }      catch (javax.transaction.xa.XAException xae)
>      { // Distributed transaction failed, so roll it back.
>        // Report XAException on prepare/commit.

This suggests that both XAResoruce.prepare() and commit() can throw a
exception.

> > (I guess that
> > UserTransaction.commit() would throw RollbackException if
> > remote-prepare has been failed for any of the remotes.)
> 
> Correct.

So UserTransaction.commit() does not throw the same exception if
remote-commit fails.  Isn't the HeuristicMixedExcpetion the exception
thrown in that case?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> If we accept each elementary-commit (via FDW connection) to fail, the
> parent(?) there's no way the root 2pc-commit can succeed.  How can we
> ignore the fdw-error in that case?

No, we don't ignore the error during FDW commit.  As mentioned at the end of this mail, the question is how the FDW
reportsthe eror to the caller (transaction  manager in Postgres core), and how we should handle it. 

As below, Glassfish catches the resource manager's error during commit, retries the commit if the error is transient or
communicationfailure, and hands off the processing of failed commit to the recovery manager.  (I used all of my energy
today;I'd be grateful if someone could figure out whether Glassfish reports the error to the application.) 


[XATerminatorImpl.java]
    public void commit(Xid xid, boolean onePhase) throws XAException {
...
                } else {
                    coord.commit();
                }


[TopCoordinator.java]
        // Commit all participants.  If a fatal error occurs during
        // this method, then the process must be ended with a fatal error.
...
            try {
                participants.distributeCommit();
            } catch (Throwable exc) {


[RegisteredResources.java]
    void distributeCommit() throws HeuristicMixed, HeuristicHazard, NotPrepared {
...
        // Browse through the participants, committing them. The following is
        // intended to be done asynchronously as a group of operations.
...
                // Tell the resource to commit.
                // Catch any exceptions here; keep going until
                // no exception is left.
...
                            // If the exception is neither TRANSIENT or
                            // COMM_FAILURE, it is unexpected, so display a
                            // message and give up with this Resource.
...
                            // For TRANSIENT or COMM_FAILURE, wait
                            // for a while, then retry the commit.
...
                            // If the retry limit has been exceeded,
                            // end the process with a fatal error.
...
        if (!transactionCompleted) {
            if (coord != null)
                RecoveryManager.addToIncompleTx(coord, true);


> > No.  Taking the description literally and considering the relevant XA
> specification, it's not about the remote commit failure.  The remote server is
> not allowed to fail the commit once it has reported successful prepare, which is
> the contract of 2PC.  HeuristicMixedException is about the manual resolution,
> typically by the DBA, using the DBMS-specific tool or the standard
> commit()/rollback() API.
>
> Mmm. The above seems as if saying that 2pc-comit does not interact
> with remotes.  The interface contract does not cover everything that
> happens in the real world. If remote-commit fails, that is just an
> issue outside of the 2pc world.  In reality remote-commit may fail for
> all reasons.

The following part of XA specification is relevant.  We're considering to model the FDW 2PC interface based on XA,
becauseit seems like the only standard interface and thus other FDWS would naturally take advantage of, aren't we?
Then,we need to take care of such things as this.  The interface design is not easy.  So, proper design and its review
shouldcome first, before going deeper into the huge code patch. 

2.3.3 Heuristic Branch Completion
--------------------------------------------------
Some RMs may employ heuristic decision-making: an RM that has prepared to
commit a transaction branch may decide to commit or roll back its work independently
of the TM. It could then unlock shared resources. This may leave them in an
inconsistent state. When the TM ultimately directs an RM to complete the branch, the
RM may respond that it has already done so. The RM reports whether it committed
the branch, rolled it back, or completed it with mixed results (committed some work
and rolled back other work).

An RM that reports heuristic completion to the TM must not discard its knowledge of
the transaction branch. The TM calls the RM once more to authorise it to forget the
branch. This requirement means that the RM must notify the TM of all heuristic
decisions, even those that match the decision the TM requested. The referenced
OSI DTP specifications (model) and (service) define heuristics more precisely.
--------------------------------------------------


> https://www.ibm.com/docs/ja/db2-for-zos/11?topic=support-example-distr
> ibuted-transaction-that-uses-jta-methods
> This suggests that both XAResoruce.prepare() and commit() can throw a
> exception.

Yes, XAResource() can throw an exception:

void commit(Xid xid, boolean onePhase) throws XAException

Throws: XAException
An error has occurred. Possible XAExceptions are XA_HEURHAZ, XA_HEURCOM,
XA_HEURRB, XA_HEURMIX, XAER_RMERR, XAER_RMFAIL, XAER_NOTA,
XAER_INVAL, or XAER_PROTO.

This is equivalent to xa_commit() in the XA specification.  xa_commit() can return an error code that have the same
namesas above. 

The question we're trying to answer here is:

* How such an error should be handled?
Glassfish (and possibly other Java EE servers) catch the error, continue to commit the rest of participants, and handle
thefailed resource manager's commit in the background.  In Postgres, if we allow FDWs to do ereport(ERROR), how can we
dosimilar things? 

* Should we report the error to the client?  If yes, should it be reported as a failure of commit, or as an
informationalmessage (WARNING) of a successful commit?  Why does the client want to know the error, where the global
transaction'scommit has been promised? 


Regards
Takayuki Tsunakawa




Re: Transactions involving multiple postgres foreign servers, take 2

From
Robert Haas
Date:
On Fri, Jun 4, 2021 at 4:04 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> Why does the user have to get an error?  Once the local transaction has been prepared, which means all remote ones
alsohave been prepared, the whole transaction is determined to commit.  So, the user doesn't have to receive an error
aslong as the local node is alive. 

That is completely unrealistic. As Sawada-san has pointed out
repeatedly, there are tons of things that can go wrong even after the
remote side has prepared the transaction. Preparing a transaction only
promises that the remote side will let you commit the transaction upon
request. It doesn't guarantee that you'll be able to make the request.
Like Sawada-san says, network problems, out of memory issues, or many
other things could stop that from happening. Someone could come along
in another session and run "ROLLBACK PREPARED" on the remote side, and
now the "COMMIT PREPARED" will never succeed no matter how many times
you try it. At least, not unless someone goes and creates a new
prepared transaction with the same 2PC identifier, but then you won't
be committing the correct transaction anyway. Or someone could take
the remote server and drop it in a volcano. How do you propose that we
avoid giving the user an error after the remote server has been
dropped into a volcano, even though the local node is still alive?

Also, leaving aside theoretical arguments, I think it's not
realistically possible for an FDW author to write code to commit a
prepared transaction that will be safe in the context of running late
in PrepareTransaction(), after we've already done
RecordTransactionCommit(). Such code can't avoid throwing errors
because it can't avoid performing operations and allocating memory.
It's already been mentioned that, if an ERROR is thrown, it would be
reported to the user in place of the COMMIT acknowledgement that they
are expecting. Now, it has also been suggested that we could downgrade
the ERROR to a WARNING and still report the COMMIT. That doesn't sound
easy to do, because when the ERROR happens, control is going to jump
to AbortTransaction(). But even if you could hack it so it works like
that, it doesn't really solve the problem. What about all of the other
servers where the prepared transaction also needs to be committed? In
the design of PostgreSQL, in all circumstances, the way you recover
from an error is to abort the transaction. That is what brings the
system back to a clean state. You can't simply ignore the requirement
to abort the transaction and keep doing more work. It will never be
reliable, and Tom will instantaneously demand that any code works like
that be reverted -- and for good reason.

I am not sure that it's 100% impossible to find a way to solve this
problem without just having the resolver do all the work, but I think
it's going to be extremely difficult. We tried to figure out some
vaguely similar things while working on undo, and it really didn't go
very well. The later stages of CommitTransaction() and
AbortTransaction() are places where very few kinds of code are safe to
execute, and finding a way to patch around that problem is not simple
either. If the resolver performance is poor, perhaps we could try to
find a way to improve it. I don't know. But I don't think it does any
good to say, well, no errors can occur after the remote transaction is
prepared. That's clearly incorrect.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Robert Haas <robertmhaas@gmail.com>
> That is completely unrealistic. As Sawada-san has pointed out
> repeatedly, there are tons of things that can go wrong even after the
> remote side has prepared the transaction. Preparing a transaction only
> promises that the remote side will let you commit the transaction upon
> request. It doesn't guarantee that you'll be able to make the request.
> Like Sawada-san says, network problems, out of memory issues, or many
> other things could stop that from happening. Someone could come along
> in another session and run "ROLLBACK PREPARED" on the remote side, and
> now the "COMMIT PREPARED" will never succeed no matter how many times
> you try it. At least, not unless someone goes and creates a new
> prepared transaction with the same 2PC identifier, but then you won't
> be committing the correct transaction anyway. Or someone could take
> the remote server and drop it in a volcano. How do you propose that we
> avoid giving the user an error after the remote server has been
> dropped into a volcano, even though the local node is still alive?

I understand that.  As I cited yesterday and possibly before, that's why xa_commit() returns various return codes.  So,
Ihave never suggested that FDWs should not report an error and always report success for the commit request.  They
shouldbe allowed to report an error.
 

The question I have been asking is how.  With that said, we should only have two options; one is the return value of
theFDW commit routine, and the other is via ereport(ERROR).  I suggested the possibility of the former, because if the
FDWdoes ereport(ERROR), Postgres core (transaction manager) may have difficulty in handling the rest of the
participants.


> Also, leaving aside theoretical arguments, I think it's not
> realistically possible for an FDW author to write code to commit a
> prepared transaction that will be safe in the context of running late
> in PrepareTransaction(), after we've already done
> RecordTransactionCommit(). Such code can't avoid throwing errors
> because it can't avoid performing operations and allocating memory.

I'm not completely sure about this.  I thought (and said) that the only thing the FDW does would be to send a commit
requestthrough an existing connection.  So, I think it's not a severe restriction to require FDWs to do ereport(ERROR)
duringcommits (of the second phase of 2PC.)
 


> It's already been mentioned that, if an ERROR is thrown, it would be
> reported to the user in place of the COMMIT acknowledgement that they
> are expecting. Now, it has also been suggested that we could downgrade
> the ERROR to a WARNING and still report the COMMIT. That doesn't sound
> easy to do, because when the ERROR happens, control is going to jump
> to AbortTransaction(). But even if you could hack it so it works like
> that, it doesn't really solve the problem. What about all of the other
> servers where the prepared transaction also needs to be committed? In
> the design of PostgreSQL, in all circumstances, the way you recover
> from an error is to abort the transaction. That is what brings the
> system back to a clean state. You can't simply ignore the requirement
> to abort the transaction and keep doing more work. It will never be
> reliable, and Tom will instantaneously demand that any code works like
> that be reverted -- and for good reason.

(I took "abort" as the same as "rollback" here.)  Once we've sent commit requests to some participants, we can't abort
thetransaction.  If one FDW returned an error halfway, we need to send commit requests to the rest of participants.
 

It's a design question, as I repeatedly said, whether and how we should report the error of some participants to the
client. For instance, how should we report the errors of multiple participants?  Concatenate those error messages?
 

Anyway, we should design the interface first, giving much thought and respecting the ideas of predecessors (TX/XA, MS
DTC,JTA/JTS).  Otherwise, we may end up like "We implemented like this, so the interface is like this and it can only
behavelike this, although you may find it strange..."  That might be a situation similar to what your comment "the
designof PostgreSQL, in all circumstances, the way you recover from an error is to abort the transaction" suggests --
Postgresdoesn't have statement-level rollback.
 


> I am not sure that it's 100% impossible to find a way to solve this
> problem without just having the resolver do all the work, but I think
> it's going to be extremely difficult. We tried to figure out some
> vaguely similar things while working on undo, and it really didn't go
> very well. The later stages of CommitTransaction() and
> AbortTransaction() are places where very few kinds of code are safe to
> execute, and finding a way to patch around that problem is not simple
> either. If the resolver performance is poor, perhaps we could try to
> find a way to improve it. I don't know. But I don't think it does any
> good to say, well, no errors can occur after the remote transaction is
> prepared. That's clearly incorrect.

I don't think the resolver-based approach would bring us far enough.  It's fundamentally a bottleneck.  Such a
backgroundprocess should only handle commits whose requests failed to be sent due to server down.
 

My requests are only twofold and haven't changed for long: design the FDW interface that implementors can naturally
follow,and design to ensure performance.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Robert Haas
Date:
On Thu, Jun 10, 2021 at 9:58 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> I understand that.  As I cited yesterday and possibly before, that's why xa_commit() returns various return codes.
So,I have never suggested that FDWs should not report an error and always report success for the commit request.  They
shouldbe allowed to report an error. 

In the text to which I was responding it seemed like you were saying
the opposite. Perhaps I misunderstood.

> The question I have been asking is how.  With that said, we should only have two options; one is the return value of
theFDW commit routine, and the other is via ereport(ERROR).  I suggested the possibility of the former, because if the
FDWdoes ereport(ERROR), Postgres core (transaction manager) may have difficulty in handling the rest of the
participants.

I don't think that is going to work. It is very difficult to write
code that doesn't ever ERROR in PostgreSQL. It is not impossible if
the operation is trivial enough, but I think you're greatly
underestimating the complexity of committing the remote transaction.
If somebody had designed PostgreSQL so that every function returns a
return code and every time you call some other function you check that
return code and pass any error up to your own caller, then there would
be no problem here. But in fact the design was that at the first sign
of trouble you throw an ERROR. It's not easy to depart from that
programming model in just one place.

> > Also, leaving aside theoretical arguments, I think it's not
> > realistically possible for an FDW author to write code to commit a
> > prepared transaction that will be safe in the context of running late
> > in PrepareTransaction(), after we've already done
> > RecordTransactionCommit(). Such code can't avoid throwing errors
> > because it can't avoid performing operations and allocating memory.
>
> I'm not completely sure about this.  I thought (and said) that the only thing the FDW does would be to send a commit
requestthrough an existing connection.  So, I think it's not a severe restriction to require FDWs to do ereport(ERROR)
duringcommits (of the second phase of 2PC.) 

To send a commit request through an existing connection, you have to
send some bytes over the network using a send() or write() system
call. That can fail. Then you have to read the response back over the
network using recv() or read(). That can also fail. You also need to
parse the result that you get from the remote side, which can also
fail, because you could get back garbage for some reason. And
depending on the details, you might first need to construct the
message you're going to send, which might be able to fail too. Also,
the data might be encrypted using SSL, so you might have to decrypt
it, which can also fail, and you might need to encrypt data before
sending it, which can fail. In fact, if you're using the OpenSSL,
trying to call SSL_read() or SSL_write() can both read and write data
from the socket, even multiple times, so you have extra opportunities
to fail.

> (I took "abort" as the same as "rollback" here.)  Once we've sent commit requests to some participants, we can't
abortthe transaction.  If one FDW returned an error halfway, we need to send commit requests to the rest of
participants.

I understand that it's not possible to abort the local transaction to
abort after it's been committed, but that doesn't mean that we're
going to be able to send the commit requests to the rest of the
participants. We want to be able to do that, certainly, but there's no
guarantee that it's actually possible. Again, the remote servers may
be dropped into a volcano, or less seriously, we may not be able to
access them. Also, someone may kill off our session.

> It's a design question, as I repeatedly said, whether and how we should report the error of some participants to the
client. For instance, how should we report the errors of multiple participants?  Concatenate those error messages? 

Sure, I agree that there are some questions about how to report errors.

> Anyway, we should design the interface first, giving much thought and respecting the ideas of predecessors (TX/XA, MS
DTC,JTA/JTS).  Otherwise, we may end up like "We implemented like this, so the interface is like this and it can only
behavelike this, although you may find it strange..."  That might be a situation similar to what your comment "the
designof PostgreSQL, in all circumstances, the way you recover from an error is to abort the transaction" suggests --
Postgresdoesn't have statement-level rollback. 

I think that's a valid concern, but we also have to have a plan that
is realistic. Some things are indeed not possible in PostgreSQL's
design. Also, some of these problems are things everyone has to
somehow confront. There's no database doing 2PC that can't have a
situation where one of the machines disappears unexpectedly due to
some natural disaster or administrator interference. It might be the
case that our inability to do certain things safely during transaction
commit puts us out of compliance with the spec, but it can't be the
case that some other system has no possible failures during
transaction commit. The problem of the network potentially being
disconnected between one packet and the next exists in every system.

> I don't think the resolver-based approach would bring us far enough.  It's fundamentally a bottleneck.  Such a
backgroundprocess should only handle commits whose requests failed to be sent due to server down. 

Why is it fundamentally a bottleneck? It seems to me in some cases it
could scale better than any other approach. If we have to commit on
100 shards in only one process we can only do those commits one at a
time. If we can use resolver processes we could do all 100 at once if
the user can afford to run that many resolvers, which should be way
faster. It is true that if the resolver does not have a connection
open and must open one, that might be slow, but presumably after that
it can keep the connection open and reuse it for subsequent
distributed transactions. I don't really see why that should be
particularly slow.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/05/11 13:37, Masahiko Sawada wrote:
> I've attached the updated patches that incorporated comments from
> Zhihong and Ikeda-san.

Thanks for updating the patches!

I'm still reading these patches, but I'd like to share some review comments
that I found so far.

(1)
+/* Remove the foreign transaction from FdwXactParticipants */
+void
+FdwXactUnregisterXact(UserMapping *usermapping)
+{
+    Assert(IsTransactionState());
+    RemoveFdwXactEntry(usermapping->umid);
+}

Currently there is no user of FdwXactUnregisterXact().
This function should be removed?


(2)
When I ran the regression test, I got the following failure.

========= Contents of ./src/test/modules/test_fdwxact/regression.diffs
diff -U3 /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out
/home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out
--- /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out    2021-06-10
02:19:43.808622747+0000
 
+++ /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out    2021-06-10
02:29:53.452410462+0000
 
@@ -174,7 +174,7 @@
  SELECT count(*) FROM pg_foreign_xacts;
   count
  -------
-     1
+     4
  (1 row)


(3)
+                 errmsg("could not read foreign transaction state from xlog at %X/%X",
+                        (uint32) (lsn >> 32),
+                        (uint32) lsn)));

LSN_FORMAT_ARGS() should be used?


(4)
+extern void RecreateFdwXactFile(TransactionId xid, Oid umid, void *content,
+                                int len);

Since RecreateFdwXactFile() is used only in fdwxact.c,
the above "extern" is not necessary?


(5)
+2. Pre-Commit phase (1st phase of two-phase commit)
+we record the corresponding WAL indicating that the foreign server is involved
+with the current transaction before doing PREPARE all foreign transactions.
+Thus, in case we loose connectivity to the foreign server or crash ourselves,
+we will remember that we might have prepared tranascation on the foreign
+server, and try to resolve it when connectivity is restored or after crash
+recovery.

So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
for WAL record to be replicated to the standby if sync replication is enabled?
Otherwise, when the failover happens, new primary (past-standby)
might not have enough XLOG_FDWXACT_INSERT WAL records and
might fail to find some in-doubt foreign transactions.


(6)
XLogFlush() is called for each foreign transaction. So if there are many
foreign transactions, XLogFlush() is called too frequently. Which might
cause unnecessary performance overhead? Instead, for example,
we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
after inserting all WAL records for all foreign transactions?


(7)
      /* Open connection; report that we'll create a prepared statement. */
      fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
+    MarkConnectionModified(user);

MarkConnectionModified() should be called also when TRUNCATE on
a foreign table is executed?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Robert Haas <robertmhaas@gmail.com>
> On Thu, Jun 10, 2021 at 9:58 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > The question I have been asking is how.  With that said, we should only have
> two options; one is the return value of the FDW commit routine, and the other is
> via ereport(ERROR).  I suggested the possibility of the former, because if the
> FDW does ereport(ERROR), Postgres core (transaction manager) may have
> difficulty in handling the rest of the participants.
> 
> I don't think that is going to work. It is very difficult to write
> code that doesn't ever ERROR in PostgreSQL. It is not impossible if
> the operation is trivial enough, but I think you're greatly
> underestimating the complexity of committing the remote transaction.
> If somebody had designed PostgreSQL so that every function returns a
> return code and every time you call some other function you check that
> return code and pass any error up to your own caller, then there would
> be no problem here. But in fact the design was that at the first sign
> of trouble you throw an ERROR. It's not easy to depart from that
> programming model in just one place.

> > I'm not completely sure about this.  I thought (and said) that the only thing
> the FDW does would be to send a commit request through an existing
> connection.  So, I think it's not a severe restriction to require FDWs to do
> ereport(ERROR) during commits (of the second phase of 2PC.)
> 
> To send a commit request through an existing connection, you have to
> send some bytes over the network using a send() or write() system
> call. That can fail. Then you have to read the response back over the
> network using recv() or read(). That can also fail. You also need to
> parse the result that you get from the remote side, which can also
> fail, because you could get back garbage for some reason. And
> depending on the details, you might first need to construct the
> message you're going to send, which might be able to fail too. Also,
> the data might be encrypted using SSL, so you might have to decrypt
> it, which can also fail, and you might need to encrypt data before
> sending it, which can fail. In fact, if you're using the OpenSSL,
> trying to call SSL_read() or SSL_write() can both read and write data
> from the socket, even multiple times, so you have extra opportunities
> to fail.

I know sending a commit request may get an error from various underlying functions, but we're talking about the client
side,not the Postgres's server side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW commit routine
won'tlose control and can return an error code as its return value.  For instance, the FDW commit routine for DBMS-X
wouldtypically be:
 

int
DBMSXCommit(...)
{
    int ret;

    /* extract info from the argument to pass to xa_commit() */

    ret = DBMSX_xa_commit(...);
    /* This is the actual commit function which is exposed to the app server (e.g. Tuxedo) through the xa_commit()
interface*/
 

    /* map xa_commit() return values to the corresponding return values of the FDW commit routine */
    switch (ret)
    {
        case XA_RMERR:
            ret = ...;
            break;
        ...
    }

    return ret;
}


> I think that's a valid concern, but we also have to have a plan that
> is realistic. Some things are indeed not possible in PostgreSQL's
> design. Also, some of these problems are things everyone has to
> somehow confront. There's no database doing 2PC that can't have a
> situation where one of the machines disappears unexpectedly due to
> some natural disaster or administrator interference. It might be the
> case that our inability to do certain things safely during transaction
> commit puts us out of compliance with the spec, but it can't be the
> case that some other system has no possible failures during
> transaction commit. The problem of the network potentially being
> disconnected between one packet and the next exists in every system.

So, we need to design how commit behaves from the user's perspective.  That's the functional design.  We should figure
outwhat's the desirable response of commit first, and then see if we can implement it or have to compromise in some
way. I think we can reference the X/Open TX standard and/or JTS (Java Transaction Service) specification (I haven't had
achance to read them yet, though.)  Just in case we can't find the requested commit behavior in the volcano case from
thosespecifications, ... (I'm hesitant to say this because it may be hard,) it's desirable to follow representative
productssuch as Tuxedo and GlassFish (the reference implementation of Java EE specs.)
 


> > I don't think the resolver-based approach would bring us far enough.  It's
> fundamentally a bottleneck.  Such a background process should only handle
> commits whose requests failed to be sent due to server down.
> 
> Why is it fundamentally a bottleneck? It seems to me in some cases it
> could scale better than any other approach. If we have to commit on
> 100 shards in only one process we can only do those commits one at a
> time. If we can use resolver processes we could do all 100 at once if
> the user can afford to run that many resolvers, which should be way
> faster. It is true that if the resolver does not have a connection
> open and must open one, that might be slow, but presumably after that
> it can keep the connection open and reuse it for subsequent
> distributed transactions. I don't really see why that should be
> particularly slow.

Concurrent transactions are serialized at the resolver.  I heard that the current patch handles 2PC like this: the TM
(transactionmanager in Postgres core) requests prepare to the resolver, the resolver sends prepare to the remote server
andwait for reply, the TM gets back control from the resolver, TM requests commit to the resolver, the resolver sends
committo the remote server and wait for reply, and TM gets back control.  The resolver handles one transaction at a
time.

In regard to the case where one session has to commit on multiple remote servers, we're talking about the asynchronous
interfacejust like what the XA standard provides.
 


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Robert Haas
Date:
On Sun, Jun 13, 2021 at 10:04 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> I know sending a commit request may get an error from various underlying functions, but we're talking about the
clientside, not the Postgres's server side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW commit
routinewon't lose control and can return an error code as its return value.  For instance, the FDW commit routine for
DBMS-Xwould typically be: 
>
> int
> DBMSXCommit(...)
> {
>         int ret;
>
>         /* extract info from the argument to pass to xa_commit() */
>
>         ret = DBMSX_xa_commit(...);
>         /* This is the actual commit function which is exposed to the app server (e.g. Tuxedo) through the
xa_commit()interface */ 
>
>         /* map xa_commit() return values to the corresponding return values of the FDW commit routine */
>         switch (ret)
>         {
>                 case XA_RMERR:
>                         ret = ...;
>                         break;
>                 ...
>         }
>
>         return ret;
> }

Well, we're talking about running this commit routine from within
CommitTransaction(), right? So I think it is in fact running in the
server. And if that's so, then you have to worry about how to make it
respond to interrupts. You can't just call some functions
DBMSX_xa_commit() and wait for infinite time for it to return. Look at
pgfdw_get_result() for an example of what real code to do this looks
like.

> So, we need to design how commit behaves from the user's perspective.  That's the functional design.  We should
figureout what's the desirable response of commit first, and then see if we can implement it or have to compromise in
someway.  I think we can reference the X/Open TX standard and/or JTS (Java Transaction Service) specification (I
haven'thad a chance to read them yet, though.)  Just in case we can't find the requested commit behavior in the volcano
casefrom those specifications, ... (I'm hesitant to say this because it may be hard,) it's desirable to follow
representativeproducts such as Tuxedo and GlassFish (the reference implementation of Java EE specs.) 

Honestly, I am not quite sure what any specification has to say about
this. We're talking about what happens when a user does something with
a foreign table and then type COMMIT. That's all about providing a set
of behaviors that are consistent with how PostgreSQL works in other
situations. You can't negotiate away the requirement to handle errors
in a way that works with PostgreSQL's infrastructure, or the
requirement that any length operation handle interrupts properly, by
appealing to a specification.

> Concurrent transactions are serialized at the resolver.  I heard that the current patch handles 2PC like this: the TM
(transactionmanager in Postgres core) requests prepare to the resolver, the resolver sends prepare to the remote server
andwait for reply, the TM gets back control from the resolver, TM requests commit to the resolver, the resolver sends
committo the remote server and wait for reply, and TM gets back control.  The resolver handles one transaction at a
time.

That sounds more like a limitation of the present implementation than
a fundamental problem. We shouldn't reject the idea of having a
resolver process handle this just because the initial implementation
might be slow. If there's no fundamental problem with the idea,
parallelism and concurrency can be improved in separate patches at a
later time. It's much more important at this stage to reject ideas
that are not theoretically sound.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Robert Haas <robertmhaas@gmail.com>
> Well, we're talking about running this commit routine from within
> CommitTransaction(), right? So I think it is in fact running in the
> server. And if that's so, then you have to worry about how to make it
> respond to interrupts. You can't just call some functions
> DBMSX_xa_commit() and wait for infinite time for it to return. Look at
> pgfdw_get_result() for an example of what real code to do this looks
> like.

Postgres can do that, but other implementations can not necessaily do it, I'm afraid.  But before that, the FDW
interfacedocumentation doesn't describe anything about how to handle interrupts.  Actually, odbc_fdw and possibly other
FDWsdon't respond to interrupts.
 


> Honestly, I am not quite sure what any specification has to say about
> this. We're talking about what happens when a user does something with
> a foreign table and then type COMMIT. That's all about providing a set
> of behaviors that are consistent with how PostgreSQL works in other
> situations. You can't negotiate away the requirement to handle errors
> in a way that works with PostgreSQL's infrastructure, or the
> requirement that any length operation handle interrupts properly, by
> appealing to a specification.

What we're talking here is mainly whether commit should return success or failure when some participants failed to
commitin the second phase of 2PC.  That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
specificationsand (well-known) implementations before we conclude that we have to devise our own behavior.
 


> That sounds more like a limitation of the present implementation than
> a fundamental problem. We shouldn't reject the idea of having a
> resolver process handle this just because the initial implementation
> might be slow. If there's no fundamental problem with the idea,
> parallelism and concurrency can be improved in separate patches at a
> later time. It's much more important at this stage to reject ideas
> that are not theoretically sound.

We talked about that, and unfortunately, I haven't seen a good and feasible idea to enhance the current approach that
involvesthe resolver from the beginning of 2PC processing.  Honestly, I don't understand why such a "one prepare, one
commitin turn" serialization approach can be allowed in PostgreSQL where developers pursue best performance and even
triesto refrain from adding an if statement in a hot path.  As I showed and Ikeda-san said, other implementations have
eachclient session send prepare and commit requests.  That's a natural way to achieve reasonable concurrency and
performance.


Regards
Takayuki Tsunakawa


Re: Transactions involving multiple postgres foreign servers, take 2

From
Robert Haas
Date:
On Tue, Jun 15, 2021 at 5:51 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> Postgres can do that, but other implementations can not necessaily do it, I'm afraid.  But before that, the FDW
interfacedocumentation doesn't describe anything about how to handle interrupts.  Actually, odbc_fdw and possibly other
FDWsdon't respond to interrupts. 

Well, I'd consider that a bug.

> What we're talking here is mainly whether commit should return success or failure when some participants failed to
commitin the second phase of 2PC.  That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
specificationsand (well-known) implementations before we conclude that we have to devise our own behavior. 

Sure ... but we can only decide to do things that the implementation
can support, and running code that might fail after we've committed
locally isn't one of them.

> We talked about that, and unfortunately, I haven't seen a good and feasible idea to enhance the current approach that
involvesthe resolver from the beginning of 2PC processing.  Honestly, I don't understand why such a "one prepare, one
commitin turn" serialization approach can be allowed in PostgreSQL where developers pursue best performance and even
triesto refrain from adding an if statement in a hot path.  As I showed and Ikeda-san said, other implementations have
eachclient session send prepare and commit requests.  That's a natural way to achieve reasonable concurrency and
performance.

I think your comparison here is quite unfair. We work hard to add
overhead in hot paths where it might cost, but the FDW case involves a
network round-trip anyway, so the cost of an if-statement would surely
be insignificant. I feel like you want to assume without any evidence
that a local resolver can never be quick enough, even thought the cost
of IPC between local processes shouldn't be that high compared to a
network round trip. But you also want to suppose that we can run code
that might fail late in the commit process even though there is lots
of evidence that this will cause problems, starting with the code
comments that clearly say so.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: Transactions involving multiple postgres foreign servers, take 2

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Robert Haas <robertmhaas@gmail.com>
> On Tue, Jun 15, 2021 at 5:51 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Postgres can do that, but other implementations can not necessaily do it, I'm
> afraid.  But before that, the FDW interface documentation doesn't describe
> anything about how to handle interrupts.  Actually, odbc_fdw and possibly
> other FDWs don't respond to interrupts.
> 
> Well, I'd consider that a bug.

I kind of hesitate to call it a bug...  Unlike libpq, JDBC (for jdbc_fdw) doesn't have asynchronous interface, and
Oracleand PostgreSQL ODBC drivers don't support asynchronous interface.  Even with libpq, COMMIT (and other SQL
commands)is not always cancellable, e.g., when the (NFS) storage server gets hand while writing WAL.
 


> > What we're talking here is mainly whether commit should return success or
> failure when some participants failed to commit in the second phase of 2PC.
> That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
> specifications and (well-known) implementations before we conclude that we
> have to devise our own behavior.
> 
> Sure ... but we can only decide to do things that the implementation
> can support, and running code that might fail after we've committed
> locally isn't one of them.

Yes, I understand that Postgres may not be able to conform to specifications or well-known implementations in all
aspects. I'm just suggesting to take the stance "We carefully considered established industry specifications that we
canbase on, did our best to design the desirable behavior learned from them, but couldn't implement a few parts",
ratherthan "We did what we like and can do."
 


> I think your comparison here is quite unfair. We work hard to add
> overhead in hot paths where it might cost, but the FDW case involves a
> network round-trip anyway, so the cost of an if-statement would surely
> be insignificant. I feel like you want to assume without any evidence
> that a local resolver can never be quick enough, even thought the cost
> of IPC between local processes shouldn't be that high compared to a
> network round trip. But you also want to suppose that we can run code
> that might fail late in the commit process even though there is lots
> of evidence that this will cause problems, starting with the code
> comments that clearly say so.

There may be better examples.  What I wanted to say is just that I believe it's not PG developers' standard to allow
serialprepare and commit.  Let's make it clear what's difficult to do the 2PC from each client session in normal
operationwithout going through the resolver.
 


Regards
Takayuki Tsunakawa


RE: Transactions involving multiple postgres foreign servers, take 2

From
"k.jamison@fujitsu.com"
Date:
Hi Sawada-san,

I also tried to play a bit with the latest patches similar to Ikeda-san,
and with foreign 2PC parameter enabled/required.

> > >> b. about performance bottleneck (just share my simple benchmark
> > >> results)
> > >>
> > >> The resolver process can be performance bottleneck easily although
> > >> I think some users want this feature even if the performance is not so
> good.
> > >>
> > >> I tested with very simple workload in my laptop.
> > >>
> > >> The test condition is
> > >> * two remote foreign partitions and one transaction inserts an
> > >> entry in each partitions.
> > >> * local connection only. If NW latency became higher, the
> > >> performance became worse.
> > >> * pgbench with 8 clients.
> > >>
> > >> The test results is the following. The performance of 2PC is only
> > >> 10% performance of the one of without 2PC.
> > >>
> > >> * with foreign_twophase_commit = requried
> > >> -> If load with more than 10TPS, the number of unresolved foreign
> > >> -> transactions
> > >> is increasing and stop with the warning "Increase
> > >> max_prepared_foreign_transactions".
> > >
> > > What was the value of max_prepared_foreign_transactions?
> >
> > Now, I tested with 200.
> >
> > If each resolution is finished very soon, I thought it's enough
> > because 8clients x 2partitions = 16, though... But, it's difficult how
> > to know the stable values.
> 
> During resolving one distributed transaction, the resolver needs both one
> round trip and fsync-ing WAL record for each foreign transaction.
> Since the client doesn’t wait for the distributed transaction to be resolved,
> the resolver process can be easily bottle-neck given there are 8 clients.
> 
> If foreign transaction resolution was resolved synchronously, 16 would
> suffice.


I tested the V36 patches on my 16-core machines.
I setup two foreign servers (F1, F2) .
F1 has addressbook table.
F2 has pgbench tables (scale factor = 1).
There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
I executed the benchmark measurement on coordinator.
My custom scripts are setup in a way that queries from coordinator
would have to access the two foreign servers.

Coordinator:
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 1
foreign_twophase_commit = required

Other external servers 1 & 2 (F1 & F2):
max_prepared_transactions = 100


[select.sql]
\set int random(1, 100000)
BEGIN;
SELECT ad.name, ad.age, ac.abalance
FROM addressbook ad, pgbench_accounts ac
WHERE ad.id = :int AND ad.id = ac.aid;
COMMIT;

I then executed:
pgbench -r -c 2 -j 2 -T 60 -f select.sql coor

While there were no problems with 1-2 clients, I started having problems
when running the benchmark with more than 3 clients.

pgbench -r -c 4 -j 4 -T 60 -f select.sql coor

I got the following error on coordinator:

[95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
[95396] STATEMENT:  COMMIT;
WARNING:  there is no transaction in progress
pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2 with
IDfx_151455979_1216200_16422
 

Here's the log on foreign server 2 <F2> matching the above error:
<F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
<F2> ERROR:  maximum number of prepared transactions reached
<F2> HINT:  Increase max_prepared_transactions (currently 100).
<F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'

So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
Then I got the error:

[146926] ERROR:  maximum number of foreign transactions reached
[146926] HINT:  Increase max_prepared_foreign_transactions: "200".

So I increased the max_prepared_foreign_transactions to "300",
and got the same error of need to increase the max_prepared_transactions of foreign servers.

I just can't find the right tuning values for this.
It seems that we always run out of memory in FdwXactState insert_fdwxact 
with multiple concurrent connections during PREPARE TRANSACTION.
This one I only encountered for SELECT benchmark. 
Although I've got no problems with multiple connections for my custom scripts for
UPDATE and INSERT benchmarks when I tested up to 30 clients.

Would the following possibly solve this bottleneck problem?

> > > To speed up the foreign transaction resolution, some ideas have been
> > > discussed. As another idea, how about launching resolvers for each
> > > foreign server? That way, we resolve foreign transactions on each
> > > foreign server in parallel. If foreign transactions are concentrated
> > > on the particular server, we can have multiple resolvers for the one
> > > foreign server. It doesn’t change the fact that all foreign
> > > transaction resolutions are processed by resolver processes.
> >
> > Awesome! There seems to be another pros that even if a foreign server
> > is temporarily busy or stopped due to fail over, other foreign
> > server's transactions can be resolved.
> 
> Yes. We also might need to be careful about the order of foreign transaction
> resolution. I think we need to resolve foreign transactions in arrival order at
> least within a foreign server.

Regards,
Kirk Jamison


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/11 13:37, Masahiko Sawada wrote:
> > I've attached the updated patches that incorporated comments from
> > Zhihong and Ikeda-san.
>
> Thanks for updating the patches!
>
> I'm still reading these patches, but I'd like to share some review comments
> that I found so far.

Thank you for the comments!

>
> (1)
> +/* Remove the foreign transaction from FdwXactParticipants */
> +void
> +FdwXactUnregisterXact(UserMapping *usermapping)
> +{
> +       Assert(IsTransactionState());
> +       RemoveFdwXactEntry(usermapping->umid);
> +}
>
> Currently there is no user of FdwXactUnregisterXact().
> This function should be removed?

I think that this function can be used by  other FDW implementations
to unregister foreign transaction entry, although there is no use case
in postgres_fdw. This function corresponds to xa_unreg in the XA
specification.

>
>
> (2)
> When I ran the regression test, I got the following failure.
>
> ========= Contents of ./src/test/modules/test_fdwxact/regression.diffs
> diff -U3 /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out
/home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out
> --- /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out     2021-06-10
02:19:43.808622747+0000
 
> +++ /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out      2021-06-10
02:29:53.452410462+0000
 
> @@ -174,7 +174,7 @@
>   SELECT count(*) FROM pg_foreign_xacts;
>    count
>   -------
> -     1
> +     4
>   (1 row)

WIll fix.

>
>
> (3)
> +                                errmsg("could not read foreign transaction state from xlog at %X/%X",
> +                                               (uint32) (lsn >> 32),
> +                                               (uint32) lsn)));
>
> LSN_FORMAT_ARGS() should be used?

Agreed.

>
>
> (4)
> +extern void RecreateFdwXactFile(TransactionId xid, Oid umid, void *content,
> +                                                               int len);
>
> Since RecreateFdwXactFile() is used only in fdwxact.c,
> the above "extern" is not necessary?

Right.

>
>
> (5)
> +2. Pre-Commit phase (1st phase of two-phase commit)
> +we record the corresponding WAL indicating that the foreign server is involved
> +with the current transaction before doing PREPARE all foreign transactions.
> +Thus, in case we loose connectivity to the foreign server or crash ourselves,
> +we will remember that we might have prepared tranascation on the foreign
> +server, and try to resolve it when connectivity is restored or after crash
> +recovery.
>
> So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
> XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
> for WAL record to be replicated to the standby if sync replication is enabled?
> Otherwise, when the failover happens, new primary (past-standby)
> might not have enough XLOG_FDWXACT_INSERT WAL records and
> might fail to find some in-doubt foreign transactions.

But even if we wait for the record to be replicated, this problem
isn't completely resolved, right? If the server crashes before the
standy receives the record and the failover happens then the new
master doesn't have the record. I wonder if we need to have another
FDW API in order to get the list of prepared transactions from the
foreign server (FDW). For example in postgres_fdw case, it gets the
list of prepared transactions on the foreign server by executing a
query. It seems to me that this corresponds to xa_recover in the XA
specification.

> (6)
> XLogFlush() is called for each foreign transaction. So if there are many
> foreign transactions, XLogFlush() is called too frequently. Which might
> cause unnecessary performance overhead? Instead, for example,
> we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
> after inserting all WAL records for all foreign transactions?

Agreed.

>
>
> (7)
>         /* Open connection; report that we'll create a prepared statement. */
>         fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
> +       MarkConnectionModified(user);
>
> MarkConnectionModified() should be called also when TRUNCATE on
> a foreign table is executed?

Good catch. Will fix.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Jun 24, 2021 at 9:46 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
> I also tried to play a bit with the latest patches similar to Ikeda-san,
> and with foreign 2PC parameter enabled/required.

Thank you for testing the patch!

>
> > > >> b. about performance bottleneck (just share my simple benchmark
> > > >> results)
> > > >>
> > > >> The resolver process can be performance bottleneck easily although
> > > >> I think some users want this feature even if the performance is not so
> > good.
> > > >>
> > > >> I tested with very simple workload in my laptop.
> > > >>
> > > >> The test condition is
> > > >> * two remote foreign partitions and one transaction inserts an
> > > >> entry in each partitions.
> > > >> * local connection only. If NW latency became higher, the
> > > >> performance became worse.
> > > >> * pgbench with 8 clients.
> > > >>
> > > >> The test results is the following. The performance of 2PC is only
> > > >> 10% performance of the one of without 2PC.
> > > >>
> > > >> * with foreign_twophase_commit = requried
> > > >> -> If load with more than 10TPS, the number of unresolved foreign
> > > >> -> transactions
> > > >> is increasing and stop with the warning "Increase
> > > >> max_prepared_foreign_transactions".
> > > >
> > > > What was the value of max_prepared_foreign_transactions?
> > >
> > > Now, I tested with 200.
> > >
> > > If each resolution is finished very soon, I thought it's enough
> > > because 8clients x 2partitions = 16, though... But, it's difficult how
> > > to know the stable values.
> >
> > During resolving one distributed transaction, the resolver needs both one
> > round trip and fsync-ing WAL record for each foreign transaction.
> > Since the client doesn’t wait for the distributed transaction to be resolved,
> > the resolver process can be easily bottle-neck given there are 8 clients.
> >
> > If foreign transaction resolution was resolved synchronously, 16 would
> > suffice.
>
>
> I tested the V36 patches on my 16-core machines.
> I setup two foreign servers (F1, F2) .
> F1 has addressbook table.
> F2 has pgbench tables (scale factor = 1).
> There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
> I executed the benchmark measurement on coordinator.
> My custom scripts are setup in a way that queries from coordinator
> would have to access the two foreign servers.
>
> Coordinator:
> max_prepared_foreign_transactions = 200
> max_foreign_transaction_resolvers = 1
> foreign_twophase_commit = required
>
> Other external servers 1 & 2 (F1 & F2):
> max_prepared_transactions = 100
>
>
> [select.sql]
> \set int random(1, 100000)
> BEGIN;
> SELECT ad.name, ad.age, ac.abalance
> FROM addressbook ad, pgbench_accounts ac
> WHERE ad.id = :int AND ad.id = ac.aid;
> COMMIT;
>
> I then executed:
> pgbench -r -c 2 -j 2 -T 60 -f select.sql coor
>
> While there were no problems with 1-2 clients, I started having problems
> when running the benchmark with more than 3 clients.
>
> pgbench -r -c 4 -j 4 -T 60 -f select.sql coor
>
> I got the following error on coordinator:
>
> [95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
> [95396] STATEMENT:  COMMIT;
> WARNING:  there is no transaction in progress
> pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2
withID fx_151455979_1216200_16422 
>
> Here's the log on foreign server 2 <F2> matching the above error:
> <F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
> <F2> ERROR:  maximum number of prepared transactions reached
> <F2> HINT:  Increase max_prepared_transactions (currently 100).
> <F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>
> So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
> Then I got the error:
>
> [146926] ERROR:  maximum number of foreign transactions reached
> [146926] HINT:  Increase max_prepared_foreign_transactions: "200".
>
> So I increased the max_prepared_foreign_transactions to "300",
> and got the same error of need to increase the max_prepared_transactions of foreign servers.
>
> I just can't find the right tuning values for this.
> It seems that we always run out of memory in FdwXactState insert_fdwxact
> with multiple concurrent connections during PREPARE TRANSACTION.
> This one I only encountered for SELECT benchmark.
> Although I've got no problems with multiple connections for my custom scripts for
> UPDATE and INSERT benchmarks when I tested up to 30 clients.
>
> Would the following possibly solve this bottleneck problem?

With the following idea, the performance will get better but will not
be completely solved. Because those results shared by you and
Ikeda-san come from the fact that with the patch we asynchronously
commit the foreign prepared transaction (i.g., asynchronously
performing the second phase of 2PC), but not the architecture. As I
mentioned before, I intentionally removed the synchronous committing
foreign prepared transaction part from the patch set since we still
need to have a discussion of that part. Therefore, with this version
patch, the backend returns OK to the client right after the local
transaction commits with neither committing foreign prepared
transactions by itself nor waiting for those to be committed by the
resolver process.  As long as the backend doesn’t wait for foreign
prepared transactions to be committed and there is a limit of the
number of foreign prepared transactions to be held, it could reach the
upper bound if committing foreign prepared transactions cannot keep
up.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Thu, Jun 24, 2021 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >
> >
> >
> > (5)
> > +2. Pre-Commit phase (1st phase of two-phase commit)
> > +we record the corresponding WAL indicating that the foreign server is involved
> > +with the current transaction before doing PREPARE all foreign transactions.
> > +Thus, in case we loose connectivity to the foreign server or crash ourselves,
> > +we will remember that we might have prepared tranascation on the foreign
> > +server, and try to resolve it when connectivity is restored or after crash
> > +recovery.
> >
> > So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
> > XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
> > for WAL record to be replicated to the standby if sync replication is enabled?
> > Otherwise, when the failover happens, new primary (past-standby)
> > might not have enough XLOG_FDWXACT_INSERT WAL records and
> > might fail to find some in-doubt foreign transactions.
>
> But even if we wait for the record to be replicated, this problem
> isn't completely resolved, right?

Ah, I misunderstood the order of writing WAL records and preparing
foreign transactions. You're right. Combining your suggestion below,
perhaps we need to write all WAL records, call XLogFlush(), wait for
those records to be replicated, and prepare all foreign transactions.
Even in cases where the server crashes during preparing a foreign
transaction and the failover happens, the new master has all foreign
transaction entries. Some of them might not actually be prepared on
the foreign servers but it should not be a problem.

> > (6)
> > XLogFlush() is called for each foreign transaction. So if there are many
> > foreign transactions, XLogFlush() is called too frequently. Which might
> > cause unnecessary performance overhead? Instead, for example,
> > we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
> > after inserting all WAL records for all foreign transactions?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/06/24 22:27, Masahiko Sawada wrote:
> On Thu, Jun 24, 2021 at 9:46 PM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>>
>> Hi Sawada-san,
>>
>> I also tried to play a bit with the latest patches similar to Ikeda-san,
>> and with foreign 2PC parameter enabled/required.
> 
> Thank you for testing the patch!
> 
>>
>>>>>> b. about performance bottleneck (just share my simple benchmark
>>>>>> results)
>>>>>>
>>>>>> The resolver process can be performance bottleneck easily although
>>>>>> I think some users want this feature even if the performance is not so
>>> good.
>>>>>>
>>>>>> I tested with very simple workload in my laptop.
>>>>>>
>>>>>> The test condition is
>>>>>> * two remote foreign partitions and one transaction inserts an
>>>>>> entry in each partitions.
>>>>>> * local connection only. If NW latency became higher, the
>>>>>> performance became worse.
>>>>>> * pgbench with 8 clients.
>>>>>>
>>>>>> The test results is the following. The performance of 2PC is only
>>>>>> 10% performance of the one of without 2PC.
>>>>>>
>>>>>> * with foreign_twophase_commit = requried
>>>>>> -> If load with more than 10TPS, the number of unresolved foreign
>>>>>> -> transactions
>>>>>> is increasing and stop with the warning "Increase
>>>>>> max_prepared_foreign_transactions".
>>>>>
>>>>> What was the value of max_prepared_foreign_transactions?
>>>>
>>>> Now, I tested with 200.
>>>>
>>>> If each resolution is finished very soon, I thought it's enough
>>>> because 8clients x 2partitions = 16, though... But, it's difficult how
>>>> to know the stable values.
>>>
>>> During resolving one distributed transaction, the resolver needs both one
>>> round trip and fsync-ing WAL record for each foreign transaction.
>>> Since the client doesn’t wait for the distributed transaction to be resolved,
>>> the resolver process can be easily bottle-neck given there are 8 clients.
>>>
>>> If foreign transaction resolution was resolved synchronously, 16 would
>>> suffice.
>>
>>
>> I tested the V36 patches on my 16-core machines.
>> I setup two foreign servers (F1, F2) .
>> F1 has addressbook table.
>> F2 has pgbench tables (scale factor = 1).
>> There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
>> I executed the benchmark measurement on coordinator.
>> My custom scripts are setup in a way that queries from coordinator
>> would have to access the two foreign servers.
>>
>> Coordinator:
>> max_prepared_foreign_transactions = 200
>> max_foreign_transaction_resolvers = 1
>> foreign_twophase_commit = required
>>
>> Other external servers 1 & 2 (F1 & F2):
>> max_prepared_transactions = 100
>>
>>
>> [select.sql]
>> \set int random(1, 100000)
>> BEGIN;
>> SELECT ad.name, ad.age, ac.abalance
>> FROM addressbook ad, pgbench_accounts ac
>> WHERE ad.id = :int AND ad.id = ac.aid;
>> COMMIT;
>>
>> I then executed:
>> pgbench -r -c 2 -j 2 -T 60 -f select.sql coor
>>
>> While there were no problems with 1-2 clients, I started having problems
>> when running the benchmark with more than 3 clients.
>>
>> pgbench -r -c 4 -j 4 -T 60 -f select.sql coor
>>
>> I got the following error on coordinator:
>>
>> [95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
>> [95396] STATEMENT:  COMMIT;
>> WARNING:  there is no transaction in progress
>> pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2
withID fx_151455979_1216200_16422
 
>>
>> Here's the log on foreign server 2 <F2> matching the above error:
>> <F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>> <F2> ERROR:  maximum number of prepared transactions reached
>> <F2> HINT:  Increase max_prepared_transactions (currently 100).
>> <F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>>
>> So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
>> Then I got the error:
>>
>> [146926] ERROR:  maximum number of foreign transactions reached
>> [146926] HINT:  Increase max_prepared_foreign_transactions: "200".
>>
>> So I increased the max_prepared_foreign_transactions to "300",
>> and got the same error of need to increase the max_prepared_transactions of foreign servers.
>>
>> I just can't find the right tuning values for this.
>> It seems that we always run out of memory in FdwXactState insert_fdwxact
>> with multiple concurrent connections during PREPARE TRANSACTION.
>> This one I only encountered for SELECT benchmark.
>> Although I've got no problems with multiple connections for my custom scripts for
>> UPDATE and INSERT benchmarks when I tested up to 30 clients.
>>
>> Would the following possibly solve this bottleneck problem?
> 
> With the following idea, the performance will get better but will not
> be completely solved. Because those results shared by you and
> Ikeda-san come from the fact that with the patch we asynchronously
> commit the foreign prepared transaction (i.g., asynchronously
> performing the second phase of 2PC), but not the architecture. As I
> mentioned before, I intentionally removed the synchronous committing
> foreign prepared transaction part from the patch set since we still
> need to have a discussion of that part. Therefore, with this version
> patch, the backend returns OK to the client right after the local
> transaction commits with neither committing foreign prepared
> transactions by itself nor waiting for those to be committed by the
> resolver process.  As long as the backend doesn’t wait for foreign
> prepared transactions to be committed and there is a limit of the
> number of foreign prepared transactions to be held, it could reach the
> upper bound if committing foreign prepared transactions cannot keep
> up.

Hi Jamison-san, sawada-san,

Thanks for testing!

FWIF, I tested using pgbench with "--rate=" option to know the server
can execute transactions with stable throughput. As sawada-san said,
the latest patch resolved second phase of 2PC asynchronously. So,
it's difficult to control the stable throughput without "--rate=" option.

I also worried what I should do when the error happened because to increase
"max_prepared_foreign_transaction" doesn't work. Since too overloading may
show the error, is it better to add the case to the HINT message?


BTW, if sawada-san already develop to run the resolver processes in parallel,
why don't you measure performance improvement? Although Robert-san,
Tunakawa-san and so on are discussing what architecture is best, one
discussion point is that there is a performance risk if adopting asynchronous
approach. If we have promising solutions, I think we can make the discussion
forward.

In my understanding, there are three improvement idea. First is that to make
the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
PREPARED" remote servers in bulk. Third is to stop syncing the WAL
remove_fdwxact() after resolving is done, which I addressed in the mail sent
at June 3rd, 13:56. Since third idea is not yet discussed, there may
be my misunderstanding.

-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/06/24 22:11, Masahiko Sawada wrote:
> On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>> So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
>> XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
>> for WAL record to be replicated to the standby if sync replication is enabled?
>> Otherwise, when the failover happens, new primary (past-standby)
>> might not have enough XLOG_FDWXACT_INSERT WAL records and
>> might fail to find some in-doubt foreign transactions.
> 
> But even if we wait for the record to be replicated, this problem
> isn't completely resolved, right? If the server crashes before the
> standy receives the record and the failover happens then the new
> master doesn't have the record. I wonder if we need to have another
> FDW API in order to get the list of prepared transactions from the
> foreign server (FDW). For example in postgres_fdw case, it gets the
> list of prepared transactions on the foreign server by executing a
> query. It seems to me that this corresponds to xa_recover in the XA
> specification.

FWIF, Citus implemented as sawada-san said above [1].

Since each WAL record for PREPARE is flushing in the latest patch, the latency
became too much, especially under synchronous replication. For example, the
transaction involving three foreign servers must wait to sync "three" WAL
records for PREPARE and "one" WAL records for local commit in remote server
one by one sequentially. So, I think that Sawada-san's idea is good to improve
the latency although fdw developer's work increases.

[1]
SIGMOD 2021 525 Citus: Distributed PostgreSQL for Data Intensive Applications
From 12:27 says that how to solve unresolved prepared xacts.
https://www.youtube.com/watch?v=AlF4C60FdlQ&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi Jamison-san, sawada-san,
>
> Thanks for testing!
>
> FWIF, I tested using pgbench with "--rate=" option to know the server
> can execute transactions with stable throughput. As sawada-san said,
> the latest patch resolved second phase of 2PC asynchronously. So,
> it's difficult to control the stable throughput without "--rate=" option.
>
> I also worried what I should do when the error happened because to increase
> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
> show the error, is it better to add the case to the HINT message?
>
> BTW, if sawada-san already develop to run the resolver processes in parallel,
> why don't you measure performance improvement? Although Robert-san,
> Tunakawa-san and so on are discussing what architecture is best, one
> discussion point is that there is a performance risk if adopting asynchronous
> approach. If we have promising solutions, I think we can make the discussion
> forward.

Yeah, if we can asynchronously resolve the distributed transactions
without worrying about max_prepared_foreign_transaction error, it
would be good. But we will need synchronous resolution at some point.
I think we at least need to discuss it at this point.

I've attached the new version patch that incorporates the comments
from Fujii-san and Ikeda-san I got so far. We launch a resolver
process per foreign server, committing prepared foreign transactions
on foreign servers in parallel. To get a better performance based on
the current architecture, we can have multiple resolver processes per
foreign server but it seems not easy to tune it in practice. Perhaps
is it better if we simply have a pool of resolver processes and we
assign a resolver process to the resolution of one distributed
transaction one by one? That way, we need to launch resolver processes
as many as the concurrent backends using 2PC.

> In my understanding, there are three improvement idea. First is that to make
> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
> remove_fdwxact() after resolving is done, which I addressed in the mail sent
> at June 3rd, 13:56. Since third idea is not yet discussed, there may
> be my misunderstanding.

Yes, those optimizations are promising. On the other hand, they could
introduce complexity to the code and APIs. I'd like to keep the first
version simple. I think we need to discuss them at this stage but can
leave the implementation of both parallel execution and batch
execution as future improvements.

For the third idea, I think the implementation was wrong; it removes
the state file then flushes the WAL record. I think these should be
performed in the reverse order. Otherwise, FdwXactState entry could be
left on the standby if the server crashes between them. I might be
missing something though.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiro Ikeda
Date:

On 2021/06/30 10:05, Masahiko Sawada wrote:
> On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>> Hi Jamison-san, sawada-san,
>>
>> Thanks for testing!
>>
>> FWIF, I tested using pgbench with "--rate=" option to know the server
>> can execute transactions with stable throughput. As sawada-san said,
>> the latest patch resolved second phase of 2PC asynchronously. So,
>> it's difficult to control the stable throughput without "--rate=" option.
>>
>> I also worried what I should do when the error happened because to increase
>> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
>> show the error, is it better to add the case to the HINT message?
>>
>> BTW, if sawada-san already develop to run the resolver processes in parallel,
>> why don't you measure performance improvement? Although Robert-san,
>> Tunakawa-san and so on are discussing what architecture is best, one
>> discussion point is that there is a performance risk if adopting asynchronous
>> approach. If we have promising solutions, I think we can make the discussion
>> forward.
> 
> Yeah, if we can asynchronously resolve the distributed transactions
> without worrying about max_prepared_foreign_transaction error, it
> would be good. But we will need synchronous resolution at some point.
> I think we at least need to discuss it at this point.
> 
> I've attached the new version patch that incorporates the comments
> from Fujii-san and Ikeda-san I got so far. We launch a resolver
> process per foreign server, committing prepared foreign transactions
> on foreign servers in parallel. To get a better performance based on
> the current architecture, we can have multiple resolver processes per
> foreign server but it seems not easy to tune it in practice. Perhaps
> is it better if we simply have a pool of resolver processes and we
> assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes
> as many as the concurrent backends using 2PC.

Thanks for updating the patches.

I have tested in my local laptop and summary is the following.

(1) The latest patch(v37) can improve throughput by 1.5 times compared to v36.

Although I expected it improves by 2.0 times because the workload is that one
transaction access two remote servers... I think the reason is that the disk
is bottleneck and I couldn't prepare disks for each postgresql servers. If I
could, I think the performance can be improved by 2.0 times.


(2) The latest patch(v37) throughput of foreign_twophase_commit = required is
about 36% compared to the case if foreign_twophase_commit = disabled.

Although the throughput is improved, the absolute performance is not good. It
may be the fate of 2PC. I think the reason is that the number of WAL writes is
much increase and, the disk writes in my laptop is the bottleneck. I want to
know the result testing in richer environments if someone can do so.


(3) The latest patch(v37) has no overhead if foreign_twophase_commit =
disabled. On the contrary, the performance improved by 3%. It may be within
the margin of error.



The test detail is following.

# condition

* 1 coordinator and 3 foreign servers

* 4 instance shared one ssd disk.

* one transaction queries different two foreign servers.

``` fxact_update.pgbench
\set id random(1, 1000000)

\set partnum 3
\set p1 random(1, :partnum)
\set p2 ((:p1 + 1) % :partnum) + 1

BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

* pgbench generates load. I increased ${RATE} little by little until "maximum
number of foreign transactions reached" error happens.

```
pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180
```

* parameters
max_prepared_transactions = 100
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 4


# test source code patterns

1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required).
2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required).
3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled).
4. 2595e039 without 2pc patches(v37).


# results

1. tps = 241.8000TPS
   latency average = 10.413ms

2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.)
   latency average = 15.427ms

3. tps = 987.372220 ( by 1.03% compared to 4. )
   latency average = 8.102ms

4. tps = 955.984574
   latency average = 8.368ms

The disk is the bottleneck in my environment because disk util is almost 100%
in every pattern. If disks for each instance can be prepared, I think we can
expect more performance improvements.


>> In my understanding, there are three improvement idea. First is that to make
>> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
>> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
>> remove_fdwxact() after resolving is done, which I addressed in the mail sent
>> at June 3rd, 13:56. Since third idea is not yet discussed, there may
>> be my misunderstanding.
> 
> Yes, those optimizations are promising. On the other hand, they could
> introduce complexity to the code and APIs. I'd like to keep the first
> version simple. I think we need to discuss them at this stage but can
> leave the implementation of both parallel execution and batch
> execution as future improvements.

OK, I agree.


> For the third idea, I think the implementation was wrong; it removes
> the state file then flushes the WAL record. I think these should be
> performed in the reverse order. Otherwise, FdwXactState entry could be
> left on the standby if the server crashes between them. I might be
> missing something though.

Oh, I see. I think you're right though what you wanted to say is that it
flushes the WAL records then removes the state file. If "COMMIT/ABORT
PREPARED" statements execute in bulk, it seems enough to sync the wal only
once, then remove all related state files.


BTW, I tested the binary building with -O2, and I got the following warnings.
It's needed to be fixed.

```
fdwxact.c: In function 'PrepareAllFdwXacts':
fdwxact.c:897:13: warning: 'flush_lsn' may be used uninitialized in this
function [-Wmaybe-uninitialized]
  897 |  canceled = SyncRepWaitForLSN(flush_lsn, false);
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"r.takahashi_2@fujitsu.com"
Date:
Hi,


I'm interested in this patch and I also run the same test with Ikeda-san's fxact_update.pgbench.
In my environment (poor spec VM), the result is following.

* foreign_twophase_commit = disabled
363tps

* foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
13tps


I analyzed the bottleneck using pstack and strace.
I noticed that the open() during "COMMIT PREPARED" command is very slow.

In my environment the latency of the "COMMIT PREPARED" is 16ms.
(On the other hand, the latency of "COMMIT" and "PREPARE TRANSACTION" is 1ms)
In the "COMMIT PREPARED" command, open() for wal segment file takes 14ms.
Therefore, open() is the bottleneck of "COMMIT PREPARED".
Furthermore, I noticed that the backend process almost always open the same wal segment file.

In the current patch, the backend process on foreign server which is associated with the connection from the resolver
processalways run "COMMIT PREPARED" command. 
Therefore, the wal segment file of the current "COMMIT PREPARED" command probably be the same with the previous "COMMIT
PREPARED"command. 

In order to improve the performance of the resolver process, I think it is useful to skip closing wal segment file
duringthe "COMMIT PREPARED" and reuse file descriptor. 
Is it possible?


Regards,
Ryohei Takahashi



RE: Transactions involving multiple postgres foreign servers, take 2

From
"k.jamison@fujitsu.com"
Date:
On Wed, June 30, 2021 10:06 (GMT+9), Masahiko Sawada wrote:
> I've attached the new version patch that incorporates the comments from
> Fujii-san and Ikeda-san I got so far. We launch a resolver process per foreign
> server, committing prepared foreign transactions on foreign servers in parallel.

Hi Sawada-san,
Thank you for the latest set of patches.
I've noticed from cfbot that the regression test failed, and I also could not compile it.

============== running regression test queries        ==============
test test_fdwxact                 ... FAILED       21 ms
============== shutting down postmaster               ==============
======================
 1 of 1 tests failed. 
======================

> To get a better performance based on the current architecture, we can have
> multiple resolver processes per foreign server but it seems not easy to tune it
> in practice. Perhaps is it better if we simply have a pool of resolver processes
> and we assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes as
> many as the concurrent backends using 2PC.

Yes, finding the right value to tune of of max_foreign_prepared_transactions and
max_prepared_transactions seem difficult. If we set the number of resolver
process to number of concurrent backends using 2PC, how do we determine
the value of max_foreign_transaction_resolvers? It might be good to set some
statistics to judge the value, then we can compare the performance from the V37
version.

-
Also, this is a bit of side topic, and I know we've been discussing how to 
improve/fix the resolver process bottlenecks, and Takahashi-san provided
the details above thread where V37 has problems. (I am joining the testing too.)

I am not sure if this has been brought up before because of the years of
thread. But I think that there is a need to consider the need to prevent for the
resolver process from an infinite wait loop of resolving a prepared foreign
transaction. Currently, when a crashed foreign server is recovered during
resolution retries, the information is recovered from WAL and files,
and the resolver process resumes the foreign transaction resolution.
However, what if we cannot (or intentionally do not want to) recover the
crashed server after a long time?

An idea is to make the resolver process to automatically stop after some
maximum number of retries.
We can call the parameter as foreign_transaction_resolution_max_retry_count.
There may be a better name, but I followed the pattern from your patch.

The server downtime can be estimated considering the proposed parameter
foreign_transaction_resolution_retry_interval (default 10s) from the
patch set.
In addition, according to docs, "a foreign server using the postgres_fdw
foreign data wrapper can have the same options that libpq accepts in
connection strings", so the connect_timeout set during CREATE SERVER can
also affect it.

Example:
    CREATE SERVER's connect_timeout setting = 5s
    foreign_transaction_resolution_retry_interval = 10s
    foreign_transaction_resolution_max_retry_count = 3

    Estimated total time before resolver stops: 
    = (5s) * (3 + 1) + (10s) * (3) = 50 s

00s: 1st connect start
05s: 1st connect timeout
(retry interval)
15s: 2nd connect start (1st retry)
20s: 2nd connect timeout
(retry interval)
30s: 3rd connect start (2nd retry)
35s: 3rd connect timeout
(retry interval)
45s: 4th connect start (3rd retry)
50s: 4th connect timeout
(resolver process stops)

Then the resolver process will not wait indefinitely and will stop after
some time depending on the setting of the above parameters.
This could be the automatic implementation of pg_stop_foreign_xact_resolver.
Assuming that resolver is stopped, then the crashed server is
decided to be restored, the user can then execute pg_resolve_foreign_xact().
Do you think the idea is feasible and we can add it as part of the patch sets?

Regards,
Kirk Jamison

Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/06/30 10:05, Masahiko Sawada wrote:
> I've attached the new version patch that incorporates the comments
> from Fujii-san and Ikeda-san I got so far.

Thanks for updating the patches!

I'm now reading 0001 and 0002 patches and wondering if we can commit them
at first because they just provide independent basic mechanism for
foreign transaction management.

One question regarding them is; Why did we add new API only for "top" foreign
transaction? Even with those patches, old API (CallSubXactCallbacks) is still
being used for foreign subtransaction and xact_depth is still being managed
in postgres_fdw layer (not PostgreSQL core). Is this intentional?
Sorry if this was already discussed before.

As far as I read the code, keep using old API for foreign subtransaction doesn't
cause any actual bug. But it's just strange and half-baked to manage top and
sub transaction in the differenet layer and to use old and new API for them.

OTOH, I'm afraid that adding new (not-essential) API for foreign subtransaction
might increase the code complexity unnecessarily.

Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
Sorry for the late reply.

On Mon, Jul 5, 2021 at 3:29 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/06/30 10:05, Masahiko Sawada wrote:
> > On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >>
> >> Hi Jamison-san, sawada-san,
> >>
> >> Thanks for testing!
> >>
> >> FWIF, I tested using pgbench with "--rate=" option to know the server
> >> can execute transactions with stable throughput. As sawada-san said,
> >> the latest patch resolved second phase of 2PC asynchronously. So,
> >> it's difficult to control the stable throughput without "--rate=" option.
> >>
> >> I also worried what I should do when the error happened because to increase
> >> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
> >> show the error, is it better to add the case to the HINT message?
> >>
> >> BTW, if sawada-san already develop to run the resolver processes in parallel,
> >> why don't you measure performance improvement? Although Robert-san,
> >> Tunakawa-san and so on are discussing what architecture is best, one
> >> discussion point is that there is a performance risk if adopting asynchronous
> >> approach. If we have promising solutions, I think we can make the discussion
> >> forward.
> >
> > Yeah, if we can asynchronously resolve the distributed transactions
> > without worrying about max_prepared_foreign_transaction error, it
> > would be good. But we will need synchronous resolution at some point.
> > I think we at least need to discuss it at this point.
> >
> > I've attached the new version patch that incorporates the comments
> > from Fujii-san and Ikeda-san I got so far. We launch a resolver
> > process per foreign server, committing prepared foreign transactions
> > on foreign servers in parallel. To get a better performance based on
> > the current architecture, we can have multiple resolver processes per
> > foreign server but it seems not easy to tune it in practice. Perhaps
> > is it better if we simply have a pool of resolver processes and we
> > assign a resolver process to the resolution of one distributed
> > transaction one by one? That way, we need to launch resolver processes
> > as many as the concurrent backends using 2PC.
>
> Thanks for updating the patches.
>
> I have tested in my local laptop and summary is the following.

Thank you for testing!

>
> (1) The latest patch(v37) can improve throughput by 1.5 times compared to v36.
>
> Although I expected it improves by 2.0 times because the workload is that one
> transaction access two remote servers... I think the reason is that the disk
> is bottleneck and I couldn't prepare disks for each postgresql servers. If I
> could, I think the performance can be improved by 2.0 times.
>
>
> (2) The latest patch(v37) throughput of foreign_twophase_commit = required is
> about 36% compared to the case if foreign_twophase_commit = disabled.
>
> Although the throughput is improved, the absolute performance is not good. It
> may be the fate of 2PC. I think the reason is that the number of WAL writes is
> much increase and, the disk writes in my laptop is the bottleneck. I want to
> know the result testing in richer environments if someone can do so.
>
>
> (3) The latest patch(v37) has no overhead if foreign_twophase_commit =
> disabled. On the contrary, the performance improved by 3%. It may be within
> the margin of error.
>
>
>
> The test detail is following.
>
> # condition
>
> * 1 coordinator and 3 foreign servers
>
> * 4 instance shared one ssd disk.
>
> * one transaction queries different two foreign servers.
>
> ``` fxact_update.pgbench
> \set id random(1, 1000000)
>
> \set partnum 3
> \set p1 random(1, :partnum)
> \set p2 ((:p1 + 1) % :partnum) + 1
>
> BEGIN;
> UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> COMMIT;
> ```
>
> * pgbench generates load. I increased ${RATE} little by little until "maximum
> number of foreign transactions reached" error happens.
>
> ```
> pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180
> ```
>
> * parameters
> max_prepared_transactions = 100
> max_prepared_foreign_transactions = 200
> max_foreign_transaction_resolvers = 4
>
>
> # test source code patterns
>
> 1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required).
> 2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required).
> 3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled).
> 4. 2595e039 without 2pc patches(v37).
>
>
> # results
>
> 1. tps = 241.8000TPS
>    latency average = 10.413ms
>
> 2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.)
>    latency average = 15.427ms
>
> 3. tps = 987.372220 ( by 1.03% compared to 4. )
>    latency average = 8.102ms
>
> 4. tps = 955.984574
>    latency average = 8.368ms
>
> The disk is the bottleneck in my environment because disk util is almost 100%
> in every pattern. If disks for each instance can be prepared, I think we can
> expect more performance improvements.

It seems still not good performance. I'll also test using your script.

>
>
> >> In my understanding, there are three improvement idea. First is that to make
> >> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
> >> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
> >> remove_fdwxact() after resolving is done, which I addressed in the mail sent
> >> at June 3rd, 13:56. Since third idea is not yet discussed, there may
> >> be my misunderstanding.
> >
> > Yes, those optimizations are promising. On the other hand, they could
> > introduce complexity to the code and APIs. I'd like to keep the first
> > version simple. I think we need to discuss them at this stage but can
> > leave the implementation of both parallel execution and batch
> > execution as future improvements.
>
> OK, I agree.
>
>
> > For the third idea, I think the implementation was wrong; it removes
> > the state file then flushes the WAL record. I think these should be
> > performed in the reverse order. Otherwise, FdwXactState entry could be
> > left on the standby if the server crashes between them. I might be
> > missing something though.
>
> Oh, I see. I think you're right though what you wanted to say is that it
> flushes the WAL records then removes the state file. If "COMMIT/ABORT
> PREPARED" statements execute in bulk, it seems enough to sync the wal only
> once, then remove all related state files.
>
>
> BTW, I tested the binary building with -O2, and I got the following warnings.
> It's needed to be fixed.
>
> ```
> fdwxact.c: In function 'PrepareAllFdwXacts':
> fdwxact.c:897:13: warning: 'flush_lsn' may be used uninitialized in this
> function [-Wmaybe-uninitialized]
>   897 |  canceled = SyncRepWaitForLSN(flush_lsn, false);
>       |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ```

Thank you for the report. I'll fix it in the next version patch.


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
Sorry for the late reply.

On Tue, Jul 6, 2021 at 3:15 PM r.takahashi_2@fujitsu.com
<r.takahashi_2@fujitsu.com> wrote:
>
> Hi,
>
>
> I'm interested in this patch and I also run the same test with Ikeda-san's fxact_update.pgbench.

Thank you for testing!

> In my environment (poor spec VM), the result is following.
>
> * foreign_twophase_commit = disabled
> 363tps
>
> * foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
> 13tps
>
>
> I analyzed the bottleneck using pstack and strace.
> I noticed that the open() during "COMMIT PREPARED" command is very slow.
>
> In my environment the latency of the "COMMIT PREPARED" is 16ms.
> (On the other hand, the latency of "COMMIT" and "PREPARE TRANSACTION" is 1ms)
> In the "COMMIT PREPARED" command, open() for wal segment file takes 14ms.
> Therefore, open() is the bottleneck of "COMMIT PREPARED".
> Furthermore, I noticed that the backend process almost always open the same wal segment file.
>
> In the current patch, the backend process on foreign server which is associated with the connection from the resolver
processalways run "COMMIT PREPARED" command.
 
> Therefore, the wal segment file of the current "COMMIT PREPARED" command probably be the same with the previous
"COMMITPREPARED" command.
 
>
> In order to improve the performance of the resolver process, I think it is useful to skip closing wal segment file
duringthe "COMMIT PREPARED" and reuse file descriptor.
 
> Is it possible?

Not sure but it might be possible to keep holding an xlogreader for
reading PREPARE WAL records even after the transaction commit. But I
wonder how much open() for wal segment file accounts for the total
execution time of 2PC. 2PC requires 2 network round trips for each
participant. For example, if it took 500ms in total, we would not get
benefits much from the point of view of 2PC performance even if we
improved it from 14ms to 1ms.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Fri, Jul 9, 2021 at 3:26 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2021/06/30 10:05, Masahiko Sawada wrote:
> > I've attached the new version patch that incorporates the comments
> > from Fujii-san and Ikeda-san I got so far.
>
> Thanks for updating the patches!
>
> I'm now reading 0001 and 0002 patches and wondering if we can commit them
> at first because they just provide independent basic mechanism for
> foreign transaction management.
>
> One question regarding them is; Why did we add new API only for "top" foreign
> transaction? Even with those patches, old API (CallSubXactCallbacks) is still
> being used for foreign subtransaction and xact_depth is still being managed
> in postgres_fdw layer (not PostgreSQL core). Is this intentional?

Yes, it's not needed for 2PC support and I was also concerned to add
complexity to the core by adding new API for subscriptions that are
not necessarily necessary for 2PC.

> As far as I read the code, keep using old API for foreign subtransaction doesn't
> cause any actual bug. But it's just strange and half-baked to manage top and
> sub transaction in the differenet layer and to use old and new API for them.

That's a valid concern. I'm really not sure what we should do here but
I guess that even if we want to support subscriptions we have another
API dedicated for subtransaction commit and rollback.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"r.takahashi_2@fujitsu.com"
Date:
Hi Sawada-san,


Thank you for your reply.

> Not sure but it might be possible to keep holding an xlogreader for
> reading PREPARE WAL records even after the transaction commit. But I
> wonder how much open() for wal segment file accounts for the total
> execution time of 2PC. 2PC requires 2 network round trips for each
> participant. For example, if it took 500ms in total, we would not get
> benefits much from the point of view of 2PC performance even if we
> improved it from 14ms to 1ms.

I made the patch based on your advice and re-run the test on the new machine.
(The attached patch is just for test purpose.)


* foreign_twophase_commit = disabled
2686tps

* foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
311tps

* foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
2057tps


This indicate that if we can reduce the number of times to open() wal segment file during "COMMIT PREPARED", the
performancecan be improved. 

This patch can skip closing wal segment file, but I don't know when we should close.
One idea is to close when the wal segment file is recycled, but it seems difficult for backend process to do so.

BTW, in previous discussion, "Send COMMIT PREPARED remote servers in bulk" is proposed.
I imagined the new SQL interface like "COMMIT PREPARED 'prep_1', 'prep_2', ... 'prep_n'".
If we can open wal segment file during bulk COMMIT PREPARED, we can not only reduce the times of communication, but
alsoreduce the times of open() wal segment file. 


Regards,
Ryohei Takahashi

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From
Ranier Vilela
Date:
Em ter., 13 de jul. de 2021 às 01:14, r.takahashi_2@fujitsu.com <r.takahashi_2@fujitsu.com> escreveu:
Hi Sawada-san,


Thank you for your reply.

> Not sure but it might be possible to keep holding an xlogreader for
> reading PREPARE WAL records even after the transaction commit. But I
> wonder how much open() for wal segment file accounts for the total
> execution time of 2PC. 2PC requires 2 network round trips for each
> participant. For example, if it took 500ms in total, we would not get
> benefits much from the point of view of 2PC performance even if we
> improved it from 14ms to 1ms.

I made the patch based on your advice and re-run the test on the new machine.
(The attached patch is just for test purpose.)
Wouldn't it be better to explicitly initialize the pointer with NULL?
I think it's common in Postgres.

static XLogReaderState *xlogreader = NULL;



* foreign_twophase_commit = disabled
2686tps

* foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
311tps

* foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
2057tps
Nice results.

regards,
Ranier Vilela

RE: Transactions involving multiple postgres foreign servers, take 2

From
"r.takahashi_2@fujitsu.com"
Date:
Hi,


> Wouldn't it be better to explicitly initialize the pointer with NULL?

Thank you for your advice.
You are correct.

Anyway, I fixed it and re-run the performance test, it of course does not affect tps.

Regards,
Ryohei Takahashi

Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
On Tue, Jul 13, 2021 at 1:14 PM r.takahashi_2@fujitsu.com
<r.takahashi_2@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
>
> Thank you for your reply.
>
> > Not sure but it might be possible to keep holding an xlogreader for
> > reading PREPARE WAL records even after the transaction commit. But I
> > wonder how much open() for wal segment file accounts for the total
> > execution time of 2PC. 2PC requires 2 network round trips for each
> > participant. For example, if it took 500ms in total, we would not get
> > benefits much from the point of view of 2PC performance even if we
> > improved it from 14ms to 1ms.
>
> I made the patch based on your advice and re-run the test on the new machine.
> (The attached patch is just for test purpose.)

Thank you for testing!

>
>
> * foreign_twophase_commit = disabled
> 2686tps
>
> * foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
> 311tps
>
> * foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
> 2057tps

Nice improvement!

BTW did you test on the local? That is, the foreign servers are
located on the same machine?

>
>
> This indicate that if we can reduce the number of times to open() wal segment file during "COMMIT PREPARED", the
performancecan be improved.
 
>
> This patch can skip closing wal segment file, but I don't know when we should close.
> One idea is to close when the wal segment file is recycled, but it seems difficult for backend process to do so.

I guess it would be better to start a new thread for this improvement.
This idea helps not only 2PC case but also improves the
COMMIT/ROLLBACK PREPARED performance itself. Rather than thinking it
tied with this patch, I think it's good if we can discuss this patch
separately and it gets committed alone.

> BTW, in previous discussion, "Send COMMIT PREPARED remote servers in bulk" is proposed.
> I imagined the new SQL interface like "COMMIT PREPARED 'prep_1', 'prep_2', ... 'prep_n'".
> If we can open wal segment file during bulk COMMIT PREPARED, we can not only reduce the times of communication, but
alsoreduce the times of open() wal segment file.
 

What if we successfully committed 'prep_1' but an error happened
during committing another one for some reason (i.g., corrupted 2PC
state file, OOM etc)? We might return an error to the client but have
already committed 'prep_1'.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



RE: Transactions involving multiple postgres foreign servers, take 2

From
"r.takahashi_2@fujitsu.com"
Date:
Hi Sawada-san,


Thank you for your reply.

> BTW did you test on the local? That is, the foreign servers are
> located on the same machine?

Yes, I tested on the local since I cannot prepare the good network now.


> I guess it would be better to start a new thread for this improvement.

Thank you for your advice.
I started a new thread [1].


> What if we successfully committed 'prep_1' but an error happened
> during committing another one for some reason (i.g., corrupted 2PC
> state file, OOM etc)? We might return an error to the client but have
> already committed 'prep_1'.

Sorry, I don't have good idea now.
I imagined the command returns the list of the transaction id which ends with error.


[1]
https://www.postgresql.org/message-id/OS0PR01MB56828019B25CD5190AB6093282129%40OS0PR01MB5682.jpnprd01.prod.outlook.com


Regards,
Ryohei Takahashi



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/07/09 22:44, Masahiko Sawada wrote:
> On Fri, Jul 9, 2021 at 3:26 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> As far as I read the code, keep using old API for foreign subtransaction doesn't
>> cause any actual bug. But it's just strange and half-baked to manage top and
>> sub transaction in the differenet layer and to use old and new API for them.
> 
> That's a valid concern. I'm really not sure what we should do here but
> I guess that even if we want to support subscriptions we have another
> API dedicated for subtransaction commit and rollback.
Ok, so if possible I will write POC patch for new API for foreign subtransactions
and consider whether it's enough simple that we can commit into core or not.


+#define FDWXACT_FLAG_PARALLEL_WORKER    0x02    /* is parallel worker? */

This implies that parallel workers may execute PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED to the foreign server for atomic commit?
If so, what happens if the PREPARE TRANSACTION that one of
parallel workers issues fails? In this case, not only that parallel worker
but also other parallel workers and the leader should rollback the transaction
at all. That is, they should issue ROLLBACK PREPARED to the foreign servers.
This issue was already handled and addressed in the patches?

This seems not actual issue if only postgres_fdw is used. Because postgres_fdw
doesn't have IsForeignScanParallelSafe API. Right? But what about other FDW?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"k.jamison@fujitsu.com"
Date:
Hi Sawada-san,

I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
I find the feature (atomic commit for foreign transactions) very useful
and it will pave the road for having a distributed transaction management in Postgres.
Although we have not arrived at consensus at which approach is best,
there were significant reviews and major patch changes in the past 2 years.
By any chance, do you have any plans to continue this from where you left off?

Regards,
Kirk Jamison


Re: Transactions involving multiple postgres foreign servers, take 2

From
Masahiko Sawada
Date:
Hi,

On Tue, Oct 5, 2021 at 9:56 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
> I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
> I find the feature (atomic commit for foreign transactions) very useful
> and it will pave the road for having a distributed transaction management in Postgres.
> Although we have not arrived at consensus at which approach is best,
> there were significant reviews and major patch changes in the past 2 years.
> By any chance, do you have any plans to continue this from where you left off?

As I could not reply to the review comments from Fujii-san for almost
three months, I don't have enough time to move this project forward at
least for now. That's why I marked this patch as RWF. I’d like to
continue working on this project in my spare time but I know this is
not a project that can be completed by using only my spare time. If
someone wants to work on this project, I’d appreciate it and am happy
to help.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/10/05 10:38, Masahiko Sawada wrote:
> Hi,
> 
> On Tue, Oct 5, 2021 at 9:56 AM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>>
>> Hi Sawada-san,
>>
>> I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
>> I find the feature (atomic commit for foreign transactions) very useful
>> and it will pave the road for having a distributed transaction management in Postgres.
>> Although we have not arrived at consensus at which approach is best,
>> there were significant reviews and major patch changes in the past 2 years.
>> By any chance, do you have any plans to continue this from where you left off?
> 
> As I could not reply to the review comments from Fujii-san for almost
> three months, I don't have enough time to move this project forward at
> least for now. That's why I marked this patch as RWF. I’d like to
> continue working on this project in my spare time but I know this is
> not a project that can be completed by using only my spare time. If
> someone wants to work on this project, I’d appreciate it and am happy
> to help.

Probably it's time to rethink the approach. The patch introduces
foreign transaction manager into PostgreSQL core, but as far as
I review the patch, its changes look overkill and too complicated.
This seems one of reasons why we could not have yet committed
the feature even after several years.

Another concern about the approach of the patch is that it needs
to change a backend so that it additionally waits for replication
during commit phase before executing PREPARE TRANSACTION
to foreign servers. Which would decrease the performance
during commit phase furthermore.

So I wonder if it's worth revisiting the original approach, i.e.,
add the atomic commit into postgres_fdw. One disadvantage of
this is that it supports atomic commit only between foreign
PostgreSQL servers, not other various data resources like MySQL.
But I'm not sure if we really want to do atomic commit between
various FDWs. Maybe supporting only postgres_fdw is enough
for most users. Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Transactions involving multiple postgres foreign servers, take 2

From
"k.jamison@fujitsu.com"
Date:
Hi Fujii-san and Sawada-san,

Thank you very much for your replies.

> >> I noticed that this thread and its set of patches have been marked with
> "Returned with Feedback" by yourself.
> >> I find the feature (atomic commit for foreign transactions) very
> >> useful and it will pave the road for having a distributed transaction
> management in Postgres.
> >> Although we have not arrived at consensus at which approach is best,
> >> there were significant reviews and major patch changes in the past 2 years.
> >> By any chance, do you have any plans to continue this from where you left off?
> >
> > As I could not reply to the review comments from Fujii-san for almost
> > three months, I don't have enough time to move this project forward at
> > least for now. That's why I marked this patch as RWF. I’d like to
> > continue working on this project in my spare time but I know this is
> > not a project that can be completed by using only my spare time. If
> > someone wants to work on this project, I’d appreciate it and am happy
> > to help.
> 
> Probably it's time to rethink the approach. The patch introduces foreign
> transaction manager into PostgreSQL core, but as far as I review the patch, its
> changes look overkill and too complicated.
> This seems one of reasons why we could not have yet committed the feature even
> after several years.
> 
> Another concern about the approach of the patch is that it needs to change a
> backend so that it additionally waits for replication during commit phase before
> executing PREPARE TRANSACTION to foreign servers. Which would decrease the
> performance during commit phase furthermore.
> 
> So I wonder if it's worth revisiting the original approach, i.e., add the atomic
> commit into postgres_fdw. One disadvantage of this is that it supports atomic
> commit only between foreign PostgreSQL servers, not other various data
> resources like MySQL.
> But I'm not sure if we really want to do atomic commit between various FDWs.
> Maybe supporting only postgres_fdw is enough for most users. Thought?

The intention of Sawada-san's patch is grand although would be very much helpful
because it accommodates possible future support of atomic commit for
various types of FDWs. However, it's difficult to get the agreement altogether,
as other reviewers also point out the performance of commit. Another point is that
how it should work when we also implement atomic visibility (which is another
topic for distributed transactions but worth considering).
That said, if we're going to initially support it on postgres_fdw, which is simpler 
than the latest patches, we need to ensure that abnormalities and errors
are properly handled and prove that commit performance can be improved,
e.g. if we can commit not in serial but also possible in parallel.
And if possible, although not necessary during the first step, it may put at ease
the other reviewers if can we also think of the image on how to implement atomic
visibility on postgres_fdw. 
Thoughts?

Regards,
Kirk Jamison

Re: Transactions involving multiple postgres foreign servers, take 2

From
Etsuro Fujita
Date:
Hi,

On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
> That said, if we're going to initially support it on postgres_fdw, which is simpler
> than the latest patches, we need to ensure that abnormalities and errors
> are properly handled and prove that commit performance can be improved,
> e.g. if we can commit not in serial but also possible in parallel.

If it's ok with you, I'd like to work on the performance issue.  What
I have in mind is commit all remote transactions in parallel instead
of sequentially in the postgres_fdw transaction callback, as mentioned
above, but I think that would improve the performance even for
one-phase commit that we already have.  Maybe I'm missing something,
though.

Best regards,
Etsuro Fujita



Re: Transactions involving multiple postgres foreign servers, take 2

From
Fujii Masao
Date:

On 2021/10/07 19:47, Etsuro Fujita wrote:
> Hi,
> 
> On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>> That said, if we're going to initially support it on postgres_fdw, which is simpler
>> than the latest patches, we need to ensure that abnormalities and errors
>> are properly handled

Yes. One idea for this is to include the information required to resolve
outstanding prepared transactions, in the transaction identifier that
PREPARE TRANSACTION command uses. For example, we can use the XID of
local transaction and the cluster ID of local server (e.g., cluster_name
that users specify uniquely can be used for that) as that information.
If the cluster_name of local server is "server1" and its XID is now 9999,
postgres_fdw issues "PREPARE TRANSACTION 'server1_9999'" and
"COMMIT PREPARED 'server1_9999'" to the foreign servers, to end those
foreign transactions in two-phase way.

If some troubles happen, the prepared transaction with "server1_9999"
may remain unexpectedly in one foreign server. In this case we can
determine whether to commit or rollback that outstanding transaction
by checking whether the past transaction with XID 9999 was committed
or rollbacked in the server "server1". If it's committed, the prepared
transaction also should be committed, so we should execute
"COMMIT PREPARED 'server1_9999'". If it's rollbacked, the prepared
transaction also should be rollbacked. If it's in progress, we should
do nothing for that transaction.

pg_xact_status() can be used to check whether the transaction with
the specified XID was committed or rollbacked. But pg_xact_status()
can return invalid result if CLOG data for the specified XID has been
truncated by VACUUM FREEZE. To handle this case, we might need
the special table tracking the transaction status.

DBA can use the above procedure and manually resolve the outstanding
prepared transactions in foreign servers. Also probably we can implement
the function doing the procedure. If so, it might be good idea to make
background worker or cron periodically execute the function.


>> and prove that commit performance can be improved,
>> e.g. if we can commit not in serial but also possible in parallel.
> 
> If it's ok with you, I'd like to work on the performance issue.  What
> I have in mind is commit all remote transactions in parallel instead
> of sequentially in the postgres_fdw transaction callback, as mentioned
> above, but I think that would improve the performance even for
> one-phase commit that we already have.

+100

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Transactions involving multiple postgres foreign servers, take 2

From
Etsuro Fujita
Date:
Fujii-san,

On Thu, Oct 7, 2021 at 11:37 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> On 2021/10/07 19:47, Etsuro Fujita wrote:
> > On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
> > <k.jamison@fujitsu.com> wrote:
> >> and prove that commit performance can be improved,
> >> e.g. if we can commit not in serial but also possible in parallel.
> >
> > If it's ok with you, I'd like to work on the performance issue.  What
> > I have in mind is commit all remote transactions in parallel instead
> > of sequentially in the postgres_fdw transaction callback, as mentioned
> > above, but I think that would improve the performance even for
> > one-phase commit that we already have.
>
> +100

I’ve started working on this.  Once I have a (POC) patch, I’ll post it
in a new thread, as I think it can be discussed separately.

Thanks!

Best regards,
Etsuro Fujita