Thread: Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

11 June 2018, 07:53:08

On Tue, Jun 5, 2018 at 7:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Regarding to API design, should we use 2PC for a distributed
>>> transaction if both two or more 2PC-capable foreign servers and
>>> 2PC-non-capable foreign server are involved with it?  Or should we end
>>> up with an error? the 2PC-non-capable server might be either that has
>>> 2PC functionality but just disables it or that doesn't have it.
>>
>> It seems to me that this is functionality that many people will not
>> want to use.  First, doing a PREPARE and then a COMMIT for each FDW
>> write transaction is bound to be more expensive than just doing a
>> COMMIT.  Second, because the default value of
>> max_prepared_transactions is 0, this can only work at all if special
>> configuration has been done on the remote side.  Because of the second
>> point in particular, it seems to me that the default for this new
>> feature must be "off".  It would make to ship a default configuration
>> of PostgreSQL that doesn't work with the default configuration of
>> postgres_fdw, and I do not think we want to change the default value
>> of max_prepared_transactions.  It was changed from 5 to 0 a number of
>> years back for good reason.
>
> I'm not sure that many people will not want to use this feature
> because it seems to me that there are many people who don't want to
> use the database that is missing transaction atomicity. But I agree
> that this feature should not be enabled by default as we disable 2PC
> by default.
>
>>
>> So, I think the question could be broadened a bit: how you enable this
>> feature if you want it, and what happens if you want it but it's not
>> available for your choice of FDW?  One possible enabling method is a
>> GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
>> meaning use PREPARE for all FDW writes and fail if that's not
>> supported, or it could be three-valued, like require/prefer/disable,
>> with require throwing an error if PREPARE support is not available and
>> prefer using PREPARE where available but without failing when it isn't
>> available.  Another possibility could be to make it an FDW option,
>> possibly capable of being set at multiple levels (e.g. server or
>> foreign table).  If any FDW involved in the transaction demands
>> distributed 2PC semantics then the whole transaction must have those
>> semantics or it fails.  I was previous leaning toward the latter
>> approach, but I guess now the former approach is sounding better.  I'm
>> not totally certain I know what's best here.
>>
>
> I agree that the former is better. That way, we also can control that
> parameter at transaction level. If we allow the 'prefer' behavior we
> need to manage not only 2PC-capable foreign server but also
> 2PC-non-capable foreign server. It requires all FDW to call the
> registration function. So I think two-values parameter would be
> better.
>
> BTW, sorry for late submitting the updated patch. I'll post the
> updated patch in this week but I'd like to share the new APIs design
> beforehand.

Attached updated patches.

I've changed the new APIs to 5 functions and 1 registration function
because the rollback API can be called by both backend process and
resolver process which is not good design. The latest version patches
incorporated all comments I got except for documentation about overall
point to user. I'm considering what contents I should document it
there. I'll write it during the code patch is getting reviewed. The
basic design of new patches is almost same as the previous mail I
sent.

I introduced 5 new FDW APIs: PrepareForeignTransaction,
CommitForeignTransaction, RollbackForeignTransaction,
ResolveForeignTransaction and IsTwophaseCommitEnabled.
ResolveForeignTransaction is normally called by resolver process
whereas other four functions are called by backend process. Also I
introduced a registration function FdwXactRegisterForeignTransaction.
FDW that wish to support atomic commit requires to call this function
when a transaction opens on the foreign server. Registered foreign
transactions are controlled by the foreign transaction manager of
Postgres core and calls APIs at appropriate timing. It means that the
foreign transaction manager controls only foreign servers that are
capable of 2PC. For 2PC-non-capable foreign server, FDW must use
XactCallback to control the foreign transaction. 2PC is used at commit
when the distributed transaction modified data on two or more servers
including local server and user requested by foreign_twophase_commit
GUC parameter. All foreign transactions are prepared during pre-commit
and then commit locally. After committed locally wait for resolver
process to resolve all prepared foreign transactions. The waiting
backend is released (that is, returns the prompt to client) either
when all foreign transactions are resolved or when user requested to
waiting. If 2PC is not required, a foreign transaction is committed
during pre-commit phase of local transaction. IsTwophaseCommitEnabled
is called whenever the transaction begins to modify data on foreign
server. This is required to track whether the transaction modified
data on the foreign server that doesn't support or enable 2PC.

Atomic commit among multiple foreign servers is crash-safe. If the
coordinator server crashes during atomic commit, the foreign
transaction participants and their status are recovered during WAL
apply. Recovered foreign transactions are in doubt-state, aka dangling
transactions. If database has such transactions resolver process
periodically tries to resolve them.

I'll register this patch to next CF. Feedback is very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

03 August 2018, 11:52:24

On Mon, Jun 11, 2018 at 1:53 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Tue, Jun 5, 2018 at 7:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> Regarding to API design, should we use 2PC for a distributed
>>>> transaction if both two or more 2PC-capable foreign servers and
>>>> 2PC-non-capable foreign server are involved with it?  Or should we end
>>>> up with an error? the 2PC-non-capable server might be either that has
>>>> 2PC functionality but just disables it or that doesn't have it.
>>>
>>> It seems to me that this is functionality that many people will not
>>> want to use.  First, doing a PREPARE and then a COMMIT for each FDW
>>> write transaction is bound to be more expensive than just doing a
>>> COMMIT.  Second, because the default value of
>>> max_prepared_transactions is 0, this can only work at all if special
>>> configuration has been done on the remote side.  Because of the second
>>> point in particular, it seems to me that the default for this new
>>> feature must be "off".  It would make to ship a default configuration
>>> of PostgreSQL that doesn't work with the default configuration of
>>> postgres_fdw, and I do not think we want to change the default value
>>> of max_prepared_transactions.  It was changed from 5 to 0 a number of
>>> years back for good reason.
>>
>> I'm not sure that many people will not want to use this feature
>> because it seems to me that there are many people who don't want to
>> use the database that is missing transaction atomicity. But I agree
>> that this feature should not be enabled by default as we disable 2PC
>> by default.
>>
>>>
>>> So, I think the question could be broadened a bit: how you enable this
>>> feature if you want it, and what happens if you want it but it's not
>>> available for your choice of FDW?  One possible enabling method is a
>>> GUC (e.g. foreign_twophase_commit).  It could be true/false, with true
>>> meaning use PREPARE for all FDW writes and fail if that's not
>>> supported, or it could be three-valued, like require/prefer/disable,
>>> with require throwing an error if PREPARE support is not available and
>>> prefer using PREPARE where available but without failing when it isn't
>>> available.  Another possibility could be to make it an FDW option,
>>> possibly capable of being set at multiple levels (e.g. server or
>>> foreign table).  If any FDW involved in the transaction demands
>>> distributed 2PC semantics then the whole transaction must have those
>>> semantics or it fails.  I was previous leaning toward the latter
>>> approach, but I guess now the former approach is sounding better.  I'm
>>> not totally certain I know what's best here.
>>>
>>
>> I agree that the former is better. That way, we also can control that
>> parameter at transaction level. If we allow the 'prefer' behavior we
>> need to manage not only 2PC-capable foreign server but also
>> 2PC-non-capable foreign server. It requires all FDW to call the
>> registration function. So I think two-values parameter would be
>> better.
>>
>> BTW, sorry for late submitting the updated patch. I'll post the
>> updated patch in this week but I'd like to share the new APIs design
>> beforehand.
>
> Attached updated patches.
>
> I've changed the new APIs to 5 functions and 1 registration function
> because the rollback API can be called by both backend process and
> resolver process which is not good design. The latest version patches
> incorporated all comments I got except for documentation about overall
> point to user. I'm considering what contents I should document it
> there. I'll write it during the code patch is getting reviewed. The
> basic design of new patches is almost same as the previous mail I
> sent.
>
> I introduced 5 new FDW APIs: PrepareForeignTransaction,
> CommitForeignTransaction, RollbackForeignTransaction,
> ResolveForeignTransaction and IsTwophaseCommitEnabled.
> ResolveForeignTransaction is normally called by resolver process
> whereas other four functions are called by backend process. Also I
> introduced a registration function FdwXactRegisterForeignTransaction.
> FDW that wish to support atomic commit requires to call this function
> when a transaction opens on the foreign server. Registered foreign
> transactions are controlled by the foreign transaction manager of
> Postgres core and calls APIs at appropriate timing. It means that the
> foreign transaction manager controls only foreign servers that are
> capable of 2PC. For 2PC-non-capable foreign server, FDW must use
> XactCallback to control the foreign transaction. 2PC is used at commit
> when the distributed transaction modified data on two or more servers
> including local server and user requested by foreign_twophase_commit
> GUC parameter. All foreign transactions are prepared during pre-commit
> and then commit locally. After committed locally wait for resolver
> process to resolve all prepared foreign transactions. The waiting
> backend is released (that is, returns the prompt to client) either
> when all foreign transactions are resolved or when user requested to
> waiting. If 2PC is not required, a foreign transaction is committed
> during pre-commit phase of local transaction. IsTwophaseCommitEnabled
> is called whenever the transaction begins to modify data on foreign
> server. This is required to track whether the transaction modified
> data on the foreign server that doesn't support or enable 2PC.
>
> Atomic commit among multiple foreign servers is crash-safe. If the
> coordinator server crashes during atomic commit, the foreign
> transaction participants and their status are recovered during WAL
> apply. Recovered foreign transactions are in doubt-state, aka dangling
> transactions. If database has such transactions resolver process
> periodically tries to resolve them.
>
> I'll register this patch to next CF. Feedback is very welcome.
>

I attached the updated version patch as the previous versions conflict
with the current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Michael Paquier

Date:

02 October 2018, 09:10:44

On Fri, Aug 03, 2018 at 05:52:24PM +0900, Masahiko Sawada wrote:
> I attached the updated version patch as the previous versions conflict
> with the current HEAD.

Please note that the latest patch set does not apply anymore, so this
patch is moved to next CF, waiting on author.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

02 October 2018, 09:25:17

On Tue, Oct 2, 2018 at 3:10 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 03, 2018 at 05:52:24PM +0900, Masahiko Sawada wrote:
> > I attached the updated version patch as the previous versions conflict
> > with the current HEAD.
>
> Please note that the latest patch set does not apply anymore, so this
> patch is moved to next CF, waiting on author.

Thank you! Attached the latest version patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Chris Travers

Date:

03 October 2018, 10:40:11

The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:       not tested
Spec compliant:           not tested
Documentation:            tested, failed

I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow.
 

I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can be
resolvedlater.  One thing I may have missed in the documentation is a discussion of the limits of the current approach.
I think this would be important to document because the caveats of the current approach are significant, but the people
whoneed it will have the knowledge to work with issues if they come up.
 

The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through
globaltransactions sequentially and does not move on to the next until the previous one is resolved.  This means that
ifI have a global transaction on server A, with foreign servers B and C, and I have another one on server A with
foreignservers C and D, if server B goes down at the wrong moment, the background worker does not look like it will
detectthe failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this
isresolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it.
 

It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy to
takea first crack at such docs.
 

To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we would
usethis feature once available.
 

------------------

Testing however failed.

make installcheck-world fails with errors like the following:

 -- Modify foreign server and raise an error
  BEGIN;
  INSERT INTO ft7_twophase VALUES(8);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft8_twophase VALUES(NULL); -- violation
! ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft7_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  -- Rollback foreign transaction that involves both 2PC-capable
  -- and 2PC-non-capable foreign servers.
  BEGIN;
  INSERT INTO ft8_twophase VALUES(7);
+ ERROR:  prepread foreign transactions are disabled
+ HINT:  Set max_prepared_foreign_transactions to a nonzero value.
  INSERT INTO ft9_not_twophase VALUES(7);
+ ERROR:  current transaction is aborted, commands ignored until end of transaction block
  ROLLBACK;
  SELECT * FROM ft8_twophase;
! ERROR:  prepread foreign transactions are disabled
! HINT:  Set max_prepared_foreign_transactions to a nonzero value.

make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a new
installation. I think the test cases will have to handle that sort of setup.
 

make check in the contrib directory passes.

For reasons of test failures, I am setting this back to waiting on author.

------------------
I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once it
isin place, this may open up more options for collaboration in the area of federated and distributed storage generally.
I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic handling
partso that extensions that do not use the foreign data wrapper structure could use it as well (while this looks like a
classicSQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 

The new status of this patch is: Waiting on Author

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Chris Travers

Date:

03 October 2018, 10:56:36

On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: not tested
Spec compliant: not tested
Documentation: tested, failed

Also one really minor point: I think this is a typo (maX vs max)?

(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));

Best Regards,

Chris Travers

Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com

Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Chris Travers

Date:

03 October 2018, 11:03:03

On Wed, Oct 3, 2018 at 9:56 AM Chris Travers <chris.travers@adjust.com> wrote:

On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:

(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));

Two more critical notes here which I think are small blockers.

The error message above references a config variable that does not exist.

The correct name of the config parameter is max_foreign_transaction_resolvers

Setting that along with the following to 10 caused the tests to pass, but again it fails on default configs:

max_prepared_foreign_transactions, max_prepared_transactions

--
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Best Regards,

Chris Travers

Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com

Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Chris Travers

Date:

03 October 2018, 12:02:29

On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: not tested
Spec compliant: not tested
Documentation: tested, failed

I am hoping I am not out of order in writing this before the commitfest starts. The patch is big and long and so wanted to start on this while traffic is slow.

I find this patch quite welcome and very close to a minimum viable version. The few significant limitations can be resolved later. One thing I may have missed in the documentation is a discussion of the limits of the current approach. I think this would be important to document because the caveats of the current approach are significant, but the people who need it will have the knowledge to work with issues if they come up.

The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through global transactions sequentially and does not move on to the next until the previous one is resolved. This means that if I have a global transaction on server A, with foreign servers B and C, and I have another one on server A with foreign servers C and D, if server B goes down at the wrong moment, the background worker does not look like it will detect the failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this is resolved. Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad transaction to allow processing to continue and manual resolution (this is good and necessary because in this area there is no ability to have perfect recoverability without occasional administrative action). I would really like to see more documentation of failure cases and appropriate administrative action at present. Otherwise this is I think a minimum viable addition and I think we want it.

It is possible i missed that in the documentation. If so, my objection stands aside. If it is welcome I am happy to take a first crack at such docs.

After further testing I am pretty sure I misread the patch. It looks like one can have multiple resolvers which can, in fact, work through a queue together solving this problem. So the objection above is not valid and I withdraw that objection. I will re-review the docs in light of the experience.

To my mind thats the only blocker in the code (but see below). I can say without a doubt that I would expect we would use this feature once available.

------------------

Testing however failed.

make installcheck-world fails with errors like the following:

-- Modify foreign server and raise an error
BEGIN;
INSERT INTO ft7_twophase VALUES(8);
+ ERROR: prepread foreign transactions are disabled
+ HINT: Set max_prepared_foreign_transactions to a nonzero value.
INSERT INTO ft8_twophase VALUES(NULL); -- violation
! ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK;
SELECT * FROM ft7_twophase;
! ERROR: prepread foreign transactions are disabled
! HINT: Set max_prepared_foreign_transactions to a nonzero value.
SELECT * FROM ft8_twophase;
! ERROR: prepread foreign transactions are disabled
! HINT: Set max_prepared_foreign_transactions to a nonzero value.
-- Rollback foreign transaction that involves both 2PC-capable
-- and 2PC-non-capable foreign servers.
BEGIN;
INSERT INTO ft8_twophase VALUES(7);
+ ERROR: prepread foreign transactions are disabled
+ HINT: Set max_prepared_foreign_transactions to a nonzero value.
INSERT INTO ft9_not_twophase VALUES(7);
+ ERROR: current transaction is aborted, commands ignored until end of transaction block
ROLLBACK;
SELECT * FROM ft8_twophase;
! ERROR: prepread foreign transactions are disabled
! HINT: Set max_prepared_foreign_transactions to a nonzero value.

make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a new installation. I think the test cases will have to handle that sort of setup.

make check in the contrib directory passes.

For reasons of test failures, I am setting this back to waiting on author.

------------------
I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once it is in place, this may open up more options for collaboration in the area of federated and distributed storage generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic handling part so that extensions that do not use the foreign data wrapper structure could use it as well (while this looks like a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API.

The new status of this patch is: Waiting on Author

Best Regards,

Chris Travers

Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com

Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

10 October 2018, 04:34:24

On Wed, Oct 3, 2018 at 6:02 PM Chris Travers <chris.travers@adjust.com> wrote:
>
>
>
> On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
>>
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, failed
>> Implements feature:       not tested
>> Spec compliant:           not tested
>> Documentation:            tested, failed
>>
>> I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow. 
>>
>> I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can be
resolvedlater.  One thing I may have missed in the documentation is a discussion of the limits of the current approach.
I think this would be important to document because the caveats of the current approach are significant, but the people
whoneed it will have the knowledge to work with issues if they come up. 
>>
>> The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes through
globaltransactions sequentially and does not move on to the next until the previous one is resolved.  This means that
ifI have a global transaction on server A, with foreign servers B and C, and I have another one on server A with
foreignservers C and D, if server B goes down at the wrong moment, the background worker does not look like it will
detectthe failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until this
isresolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it. 
>>
>> It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy
totake a first crack at such docs. 
>

Thank you for reviewing the patch!

>
> After further testing I am pretty sure I misread the patch.  It looks like one can have multiple resolvers which can,
infact, work through a queue together solving this problem.  So the objection above is not valid and I withdraw that
objection. I will re-review the docs in light of the experience. 

Actually the patch doesn't solve this problem; the foreign transaction
resolver processes distributed transactions sequentially. But since
one resolver process is responsible for one database the backend
connecting to another database can complete the distributed
transaction. I understood the your concern and agreed to solve this
problem. I'll address it in the next patch.

>
>>
>>
>> To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we
woulduse this feature once available. 
>>
>> ------------------
>>
>> Testing however failed.
>>
>> make installcheck-world fails with errors like the following:
>>
>>  -- Modify foreign server and raise an error
>>   BEGIN;
>>   INSERT INTO ft7_twophase VALUES(8);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
>> ! ERROR:  current transaction is aborted, commands ignored until end of transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft7_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   -- Rollback foreign transaction that involves both 2PC-capable
>>   -- and 2PC-non-capable foreign servers.
>>   BEGIN;
>>   INSERT INTO ft8_twophase VALUES(7);
>> + ERROR:  prepread foreign transactions are disabled
>> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>   INSERT INTO ft9_not_twophase VALUES(7);
>> + ERROR:  current transaction is aborted, commands ignored until end of transaction block
>>   ROLLBACK;
>>   SELECT * FROM ft8_twophase;
>> ! ERROR:  prepread foreign transactions are disabled
>> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
>>
>> make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a
newinstallation.  I think the test cases will have to handle that sort of setup. 

The 'make installcheck' is a regression test mode to do the tests to
the existing installation. If the installation disables atomic commit
feature (e.g. max_prepared_foreign_transaction etc) the test will fail
because the feature is disabled by default.

>>
>> make check in the contrib directory passes.
>>
>> For reasons of test failures, I am setting this back to waiting on author.
>>
>> ------------------
>> I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that once
itis in place, this may open up more options for collaboration in the area of federated and distributed storage
generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic
handlingpart so that extensions that do not use the foreign data wrapper structure could use it as well (while this
lookslike a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 
>>
>> The new status of this patch is: Waiting on Author

Also, I'll update the doc in the next patch that I'll post on this week.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

19 October 2018, 12:38:35

On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 3, 2018 at 6:02 PM Chris Travers <chris.travers@adjust.com> wrote:
> >
> >
> >
> > On Wed, Oct 3, 2018 at 9:41 AM Chris Travers <chris.travers@gmail.com> wrote:
> >>
> >> The following review has been posted through the commitfest application:
> >> make installcheck-world:  tested, failed
> >> Implements feature:       not tested
> >> Spec compliant:           not tested
> >> Documentation:            tested, failed
> >>
> >> I am hoping I am not out of order in writing this before the commitfest starts.  The patch is big and long and so
wantedto start on this while traffic is slow. 
> >>
> >> I find this patch quite welcome and very close to a minimum viable version.  The few significant limitations can
beresolved later.  One thing I may have missed in the documentation is a discussion of the limits of the current
approach. I think this would be important to document because the caveats of the current approach are significant, but
thepeople who need it will have the knowledge to work with issues if they come up. 
> >>
> >> The major caveat I see in our past discussions and (if I read the patch correctly) is that the resolver goes
throughglobal transactions sequentially and does not move on to the next until the previous one is resolved.  This
meansthat if I have a global transaction on server A, with foreign servers B and C, and I have another one on server A
withforeign servers C and D, if server B goes down at the wrong moment, the background worker does not look like it
willdetect the failure and move on to try to resolve the second, so server D will have a badly set vacuum horizon until
thisis resolved.  Also if I read the patch correctly, it looks like one can invoke SQL commands to remove the bad
transactionto allow processing to continue and manual resolution (this is good and necessary because in this area there
isno ability to have perfect recoverability without occasional administrative action).  I would really like to see more
documentationof failure cases and appropriate administrative action at present.  Otherwise this is I think a minimum
viableaddition and I think we want it. 
> >>
> >> It is possible i missed that in the documentation.  If so, my objection stands aside.  If it is welcome I am happy
totake a first crack at such docs. 
> >
>
> Thank you for reviewing the patch!
>
> >
> > After further testing I am pretty sure I misread the patch.  It looks like one can have multiple resolvers which
can,in fact, work through a queue together solving this problem.  So the objection above is not valid and I withdraw
thatobjection.  I will re-review the docs in light of the experience. 
>
> Actually the patch doesn't solve this problem; the foreign transaction
> resolver processes distributed transactions sequentially. But since
> one resolver process is responsible for one database the backend
> connecting to another database can complete the distributed
> transaction. I understood the your concern and agreed to solve this
> problem. I'll address it in the next patch.
>
> >
> >>
> >>
> >> To my mind thats the only blocker in the code (but see below).  I can say without a doubt that I would expect we
woulduse this feature once available. 
> >>
> >> ------------------
> >>
> >> Testing however failed.
> >>
> >> make installcheck-world fails with errors like the following:
> >>
> >>  -- Modify foreign server and raise an error
> >>   BEGIN;
> >>   INSERT INTO ft7_twophase VALUES(8);
> >> + ERROR:  prepread foreign transactions are disabled
> >> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   INSERT INTO ft8_twophase VALUES(NULL); -- violation
> >> ! ERROR:  current transaction is aborted, commands ignored until end of transaction block
> >>   ROLLBACK;
> >>   SELECT * FROM ft7_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   SELECT * FROM ft8_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   -- Rollback foreign transaction that involves both 2PC-capable
> >>   -- and 2PC-non-capable foreign servers.
> >>   BEGIN;
> >>   INSERT INTO ft8_twophase VALUES(7);
> >> + ERROR:  prepread foreign transactions are disabled
> >> + HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>   INSERT INTO ft9_not_twophase VALUES(7);
> >> + ERROR:  current transaction is aborted, commands ignored until end of transaction block
> >>   ROLLBACK;
> >>   SELECT * FROM ft8_twophase;
> >> ! ERROR:  prepread foreign transactions are disabled
> >> ! HINT:  Set max_prepared_foreign_transactions to a nonzero value.
> >>
> >> make installcheck in the contrib directory shows the same, so that's the easiest way of reproducing, at least on a
newinstallation.  I think the test cases will have to handle that sort of setup. 
>
> The 'make installcheck' is a regression test mode to do the tests to
> the existing installation. If the installation disables atomic commit
> feature (e.g. max_prepared_foreign_transaction etc) the test will fail
> because the feature is disabled by default.
>
> >>
> >> make check in the contrib directory passes.
> >>
> >> For reasons of test failures, I am setting this back to waiting on author.
> >>
> >> ------------------
> >> I had a few other thoughts that I figure are worth sharing with the community on this patch with the idea that
onceit is in place, this may open up more options for collaboration in the area of federated and distributed storage
generally. I could imagine other foreign data wrappers using this API, and folks might want to refactor out the atomic
handlingpart so that extensions that do not use the foreign data wrapper structure could use it as well (while this
lookslike a classic SQL/MED issue, I am not sure that only foreign data wrappers would be interested in the API. 
> >>
> >> The new status of this patch is: Waiting on Author
>
> Also, I'll update the doc in the next patch that I'll post on this week.
>

Attached the updated version of patches. What I changed from the
previous version are,

* Enabled processing subsequent distributed transactions even when
previous distributed transaction continues to fail due to participants
error.
To implement this, I've splited the waiting queue into two queues: the
active queue and retry queue. All backend inserts itself to the active
queue firstly and change its state to FDW_XACT_WAITING. Once the
resolver process failed to resolve the distributed transaction, it
move the backend entry in the active queue to the retry queue and
change its state to FDW_XACT_WAITING_RETRY. The backend entries in the
active queue are processed each commit time whereas entries in the
retry queue are processed at interval of
foreign_transaction_resolution_retry_interval.

* Updated docs, added the new section "Distributed Transaction" at
Chapter 33 to explain the concept to users

* Moved atomic commit codes into src/backend/access/fdwxact directory.

* Some bug fixes.

Please reivew them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > Hello.
> >
> > # It took a long time to come here..
> >
> > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > ...
> > > * Updated docs, added the new section "Distributed Transaction" at
> > > Chapter 33 to explain the concept to users
> > >
> > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > >
> > > * Some bug fixes.
> > >
> > > Please reivew them.
> >
> > I have some comments, with apologize in advance for possible
> > duplicate or conflict with others' comments so far.
>
> Thank youf so much for reviewing this patch!
>
> >
> > 0001:
> >
> > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > relation is modified. Isn't it needed when UNLOGGED tables are
> > modified? It may be better that we have dedicated classification
> > macro or function.
>
> I think even if we do atomic commit for modifying the an UNLOGGED
> table and a remote table the data will get inconsistent if the local
> server crashes. For example, if the local server crashes after
> prepared the transaction on foreign server but before the local commit
> and, we will lose the all data of the local UNLOGGED table whereas the
> modification of remote table is rollbacked. In case of persistent
> tables, the data consistency is left. So I think the keeping data
> consistency between remote data and local unlogged table is difficult
> and want to leave it as a restriction for now. Am I missing something?
>
> >
> > The flag is handled in heapam.c. I suppose that it should be done
> > in the upper layer considering coming pluggable storage.
> > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> >
>
> Yeah, or we can set the flag after heap_insert in ExecInsert.
>
> >
> > 0002:
> >
> > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > about FdwXactAtomicCommitPartitcipants?
>
> +1, will fix it.
>
> >
> > Well, as the file comment of fdwxact.c,
> > FdwXactRegisterTransaction is called from FDW driver and
> > F_X_MarkForeignTransactionModified is called from executor. I
> > think that we should clarify who is responsible to the whole
> > sequence. Since the state of local tables affects, I suppose
> > executor is that. Couldn't we do the whole thing within executor
> > side?  I'm not sure but I feel that
> > F_X_RegisterForeignTransaction can be a part of
> > F_X_MarkForeignTransactionModified.  The callers of
> > MarkForeignTransactionModified can find whether the table is
> > involved in 2pc by IsTwoPhaseCommitEnabled interface.
>
> Indeed. We can register foreign servers by executor while FDWs don't
> need to register anything. I will remove the registration function so
> that FDW developers don't need to call the register function but only
> need to provide atomic commit APIs.
>
> >
> >
> > >       if (foreign_twophase_commit == true &&
> > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > >               ereport(ERROR,
> > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> >
> > The error is emitted when a the GUC is turned off in the
> > trasaction where MarkTransactionModify'ed. I think that the
> > number of the variables' possible states should be reduced for
> > simplicity. For example in the case, once foreign_twopase_commit
> > is checked in a transaction, subsequent changes in the
> > transaction should be ignored during the transaction.
> >
>
> I might have not gotten your comment correctly but since the
> foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> check it at commit time. Also we need to keep participant servers even
> when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> and max_foreign_xact_resolvers are > 0.
>
> I will post the updated patch in this week.
>

Attached the updated version patches.

Based on the review comment from Horiguchi-san, I've changed the
atomic commit API so that the FDW developer who wish to support atomic
commit don't need to call the register function. The atomic commit
APIs are following:

* GetPrepareId
* PrepareForeignTransaction
* CommitForeignTransaction
* RollbackForeignTransaction
* ResolveForeignTransaction
* IsTwophaseCommitEnabled

The all APIs except for GetPreapreId is required for atomic commit.

Also, I've changed the foreign_twophase_commit parameter to an enum
parameter based on the suggestion from Robert[1]. Valid values are
'required', 'prefer' and 'disabled' (default). When set to either
'required' or 'prefer' the atomic commit will be used. The difference
between 'required' and 'prefer' is that when set to 'requried' we
require for *all* modified server to be able to use 2pc whereas when
'prefer' we require 2pc where available. So if any of written
participants disables 2pc or doesn't support atomic comit API the
transaction fails. IOW, when 'required' we can commit only when data
consistency among all participant can be left.

Please review the patches.

[1] https://www.postgresql.org/message-id/CA%2BTgmob4EqxbaMp0e--jUKYT44RL4xBXkPMxF9EEAD%2ByBGAdxw%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

29 October 2018, 09:03:45

On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > >
> > > Hello.
> > >
> > > # It took a long time to come here..
> > >
> > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > ...
> > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > Chapter 33 to explain the concept to users
> > > >
> > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > >
> > > > * Some bug fixes.
> > > >
> > > > Please reivew them.
> > >
> > > I have some comments, with apologize in advance for possible
> > > duplicate or conflict with others' comments so far.
> >
> > Thank youf so much for reviewing this patch!
> >
> > >
> > > 0001:
> > >
> > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > modified? It may be better that we have dedicated classification
> > > macro or function.
> >
> > I think even if we do atomic commit for modifying the an UNLOGGED
> > table and a remote table the data will get inconsistent if the local
> > server crashes. For example, if the local server crashes after
> > prepared the transaction on foreign server but before the local commit
> > and, we will lose the all data of the local UNLOGGED table whereas the
> > modification of remote table is rollbacked. In case of persistent
> > tables, the data consistency is left. So I think the keeping data
> > consistency between remote data and local unlogged table is difficult
> > and want to leave it as a restriction for now. Am I missing something?
> >
> > >
> > > The flag is handled in heapam.c. I suppose that it should be done
> > > in the upper layer considering coming pluggable storage.
> > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > >
> >
> > Yeah, or we can set the flag after heap_insert in ExecInsert.
> >
> > >
> > > 0002:
> > >
> > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > about FdwXactAtomicCommitPartitcipants?
> >
> > +1, will fix it.
> >
> > >
> > > Well, as the file comment of fdwxact.c,
> > > FdwXactRegisterTransaction is called from FDW driver and
> > > F_X_MarkForeignTransactionModified is called from executor. I
> > > think that we should clarify who is responsible to the whole
> > > sequence. Since the state of local tables affects, I suppose
> > > executor is that. Couldn't we do the whole thing within executor
> > > side?  I'm not sure but I feel that
> > > F_X_RegisterForeignTransaction can be a part of
> > > F_X_MarkForeignTransactionModified.  The callers of
> > > MarkForeignTransactionModified can find whether the table is
> > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> >
> > Indeed. We can register foreign servers by executor while FDWs don't
> > need to register anything. I will remove the registration function so
> > that FDW developers don't need to call the register function but only
> > need to provide atomic commit APIs.
> >
> > >
> > >
> > > >       if (foreign_twophase_commit == true &&
> > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > >               ereport(ERROR,
> > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > >
> > > The error is emitted when a the GUC is turned off in the
> > > trasaction where MarkTransactionModify'ed. I think that the
> > > number of the variables' possible states should be reduced for
> > > simplicity. For example in the case, once foreign_twopase_commit
> > > is checked in a transaction, subsequent changes in the
> > > transaction should be ignored during the transaction.
> > >
> >
> > I might have not gotten your comment correctly but since the
> > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > check it at commit time. Also we need to keep participant servers even
> > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > and max_foreign_xact_resolvers are > 0.
> >
> > I will post the updated patch in this week.
> >
>
> Attached the updated version patches.
>
> Based on the review comment from Horiguchi-san, I've changed the
> atomic commit API so that the FDW developer who wish to support atomic
> commit don't need to call the register function. The atomic commit
> APIs are following:
>
> * GetPrepareId
> * PrepareForeignTransaction
> * CommitForeignTransaction
> * RollbackForeignTransaction
> * ResolveForeignTransaction
> * IsTwophaseCommitEnabled
>
> The all APIs except for GetPreapreId is required for atomic commit.
>
> Also, I've changed the foreign_twophase_commit parameter to an enum
> parameter based on the suggestion from Robert[1]. Valid values are
> 'required', 'prefer' and 'disabled' (default). When set to either
> 'required' or 'prefer' the atomic commit will be used. The difference
> between 'required' and 'prefer' is that when set to 'requried' we
> require for *all* modified server to be able to use 2pc whereas when
> 'prefer' we require 2pc where available. So if any of written
> participants disables 2pc or doesn't support atomic comit API the
> transaction fails. IOW, when 'required' we can commit only when data
> consistency among all participant can be left.
>
> Please review the patches.
>

Since the previous patch conflicts with current HEAD attached updated
set of patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

15 November 2018, 10:36:17

On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > >
> > > > Hello.
> > > >
> > > > # It took a long time to come here..
> > > >
> > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > ...
> > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > Chapter 33 to explain the concept to users
> > > > >
> > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > >
> > > > > * Some bug fixes.
> > > > >
> > > > > Please reivew them.
> > > >
> > > > I have some comments, with apologize in advance for possible
> > > > duplicate or conflict with others' comments so far.
> > >
> > > Thank youf so much for reviewing this patch!
> > >
> > > >
> > > > 0001:
> > > >
> > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > modified? It may be better that we have dedicated classification
> > > > macro or function.
> > >
> > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > table and a remote table the data will get inconsistent if the local
> > > server crashes. For example, if the local server crashes after
> > > prepared the transaction on foreign server but before the local commit
> > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > modification of remote table is rollbacked. In case of persistent
> > > tables, the data consistency is left. So I think the keeping data
> > > consistency between remote data and local unlogged table is difficult
> > > and want to leave it as a restriction for now. Am I missing something?
> > >
> > > >
> > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > in the upper layer considering coming pluggable storage.
> > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > >
> > >
> > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > >
> > > >
> > > > 0002:
> > > >
> > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > about FdwXactAtomicCommitPartitcipants?
> > >
> > > +1, will fix it.
> > >
> > > >
> > > > Well, as the file comment of fdwxact.c,
> > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > think that we should clarify who is responsible to the whole
> > > > sequence. Since the state of local tables affects, I suppose
> > > > executor is that. Couldn't we do the whole thing within executor
> > > > side?  I'm not sure but I feel that
> > > > F_X_RegisterForeignTransaction can be a part of
> > > > F_X_MarkForeignTransactionModified.  The callers of
> > > > MarkForeignTransactionModified can find whether the table is
> > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > >
> > > Indeed. We can register foreign servers by executor while FDWs don't
> > > need to register anything. I will remove the registration function so
> > > that FDW developers don't need to call the register function but only
> > > need to provide atomic commit APIs.
> > >
> > > >
> > > >
> > > > >       if (foreign_twophase_commit == true &&
> > > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > >               ereport(ERROR,
> > > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > > >
> > > > The error is emitted when a the GUC is turned off in the
> > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > number of the variables' possible states should be reduced for
> > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > is checked in a transaction, subsequent changes in the
> > > > transaction should be ignored during the transaction.
> > > >
> > >
> > > I might have not gotten your comment correctly but since the
> > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > check it at commit time. Also we need to keep participant servers even
> > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > and max_foreign_xact_resolvers are > 0.
> > >
> > > I will post the updated patch in this week.
> > >
> >
> > Attached the updated version patches.
> >
> > Based on the review comment from Horiguchi-san, I've changed the
> > atomic commit API so that the FDW developer who wish to support atomic
> > commit don't need to call the register function. The atomic commit
> > APIs are following:
> >
> > * GetPrepareId
> > * PrepareForeignTransaction
> > * CommitForeignTransaction
> > * RollbackForeignTransaction
> > * ResolveForeignTransaction
> > * IsTwophaseCommitEnabled
> >
> > The all APIs except for GetPreapreId is required for atomic commit.
> >
> > Also, I've changed the foreign_twophase_commit parameter to an enum
> > parameter based on the suggestion from Robert[1]. Valid values are
> > 'required', 'prefer' and 'disabled' (default). When set to either
> > 'required' or 'prefer' the atomic commit will be used. The difference
> > between 'required' and 'prefer' is that when set to 'requried' we
> > require for *all* modified server to be able to use 2pc whereas when
> > 'prefer' we require 2pc where available. So if any of written
> > participants disables 2pc or doesn't support atomic comit API the
> > transaction fails. IOW, when 'required' we can commit only when data
> > consistency among all participant can be left.
> >
> > Please review the patches.
> >
>
> Since the previous patch conflicts with current HEAD attached updated
> set of patches.
>

Rebased and fixed a few bugs.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

20 November 2018, 11:14:10

On Thu, Nov 15, 2018 at 7:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > >
> > > > > Hello.
> > > > >
> > > > > # It took a long time to come here..
> > > > >
> > > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > ...
> > > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > > Chapter 33 to explain the concept to users
> > > > > >
> > > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > > >
> > > > > > * Some bug fixes.
> > > > > >
> > > > > > Please reivew them.
> > > > >
> > > > > I have some comments, with apologize in advance for possible
> > > > > duplicate or conflict with others' comments so far.
> > > >
> > > > Thank youf so much for reviewing this patch!
> > > >
> > > > >
> > > > > 0001:
> > > > >
> > > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > > modified? It may be better that we have dedicated classification
> > > > > macro or function.
> > > >
> > > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > > table and a remote table the data will get inconsistent if the local
> > > > server crashes. For example, if the local server crashes after
> > > > prepared the transaction on foreign server but before the local commit
> > > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > > modification of remote table is rollbacked. In case of persistent
> > > > tables, the data consistency is left. So I think the keeping data
> > > > consistency between remote data and local unlogged table is difficult
> > > > and want to leave it as a restriction for now. Am I missing something?
> > > >
> > > > >
> > > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > > in the upper layer considering coming pluggable storage.
> > > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > > >
> > > >
> > > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > > >
> > > > >
> > > > > 0002:
> > > > >
> > > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > > about FdwXactAtomicCommitPartitcipants?
> > > >
> > > > +1, will fix it.
> > > >
> > > > >
> > > > > Well, as the file comment of fdwxact.c,
> > > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > > think that we should clarify who is responsible to the whole
> > > > > sequence. Since the state of local tables affects, I suppose
> > > > > executor is that. Couldn't we do the whole thing within executor
> > > > > side?  I'm not sure but I feel that
> > > > > F_X_RegisterForeignTransaction can be a part of
> > > > > F_X_MarkForeignTransactionModified.  The callers of
> > > > > MarkForeignTransactionModified can find whether the table is
> > > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > > >
> > > > Indeed. We can register foreign servers by executor while FDWs don't
> > > > need to register anything. I will remove the registration function so
> > > > that FDW developers don't need to call the register function but only
> > > > need to provide atomic commit APIs.
> > > >
> > > > >
> > > > >
> > > > > >       if (foreign_twophase_commit == true &&
> > > > > >               ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > > >               ereport(ERROR,
> > > > > >                               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > > >                                errmsg("cannot COMMIT a distributed transaction that has operated on foreign
serverthat doesn't support atomic commit")));
 
> > > > >
> > > > > The error is emitted when a the GUC is turned off in the
> > > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > > number of the variables' possible states should be reduced for
> > > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > > is checked in a transaction, subsequent changes in the
> > > > > transaction should be ignored during the transaction.
> > > > >
> > > >
> > > > I might have not gotten your comment correctly but since the
> > > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > > check it at commit time. Also we need to keep participant servers even
> > > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > > and max_foreign_xact_resolvers are > 0.
> > > >
> > > > I will post the updated patch in this week.
> > > >
> > >
> > > Attached the updated version patches.
> > >
> > > Based on the review comment from Horiguchi-san, I've changed the
> > > atomic commit API so that the FDW developer who wish to support atomic
> > > commit don't need to call the register function. The atomic commit
> > > APIs are following:
> > >
> > > * GetPrepareId
> > > * PrepareForeignTransaction
> > > * CommitForeignTransaction
> > > * RollbackForeignTransaction
> > > * ResolveForeignTransaction
> > > * IsTwophaseCommitEnabled
> > >
> > > The all APIs except for GetPreapreId is required for atomic commit.
> > >
> > > Also, I've changed the foreign_twophase_commit parameter to an enum
> > > parameter based on the suggestion from Robert[1]. Valid values are
> > > 'required', 'prefer' and 'disabled' (default). When set to either
> > > 'required' or 'prefer' the atomic commit will be used. The difference
> > > between 'required' and 'prefer' is that when set to 'requried' we
> > > require for *all* modified server to be able to use 2pc whereas when
> > > 'prefer' we require 2pc where available. So if any of written
> > > participants disables 2pc or doesn't support atomic comit API the
> > > transaction fails. IOW, when 'required' we can commit only when data
> > > consistency among all participant can be left.
> > >
> > > Please review the patches.
> > >
> >
> > Since the previous patch conflicts with current HEAD attached updated
> > set of patches.
> >
>
> Rebased and fixed a few bugs.
>

I got feedbacks regarding transaciton management FDW APIs at Japan
PostgreSQL Developer Meetup[1] and am considering to change these APIs
to make them consistent with XA interface[2] (xa_prepare(),
xa_commit() and xa_rollback()) as follows[3].

* FdwXactResult PrepareForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult CommitForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult RollbackForeignTransaction(FdwXactState *state, inf flags)
* char *GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int
*prep_id_len)

Where flags set variaous setttings, currently it would contain only
FDW_XACT_FLAG_ONEPHASE that requires FDW to commit in one-phase (i.e.
without preparation). And where *state would contains information
necessary for specifying transaction: serverid, userid, usermappingid
and prepared id. GetPrepareId API is optional. Also I've removed the
two_phase_commit parameter from postgres_fdw options because we can
disable to use two-phase commit protocol for distributed transactions
using by distributed_atomic_commit GUC parameter.

Foreign transactions whose FDW provides both CommitForeignTransaction
API and RollbackForeignTransaction API will be managed by the global
transaction manager automatically. In addition, if the FDW also
provide PrepareForeignTransaction API it will participate to two-phase
commit protocol as a participant. So the existing FDWs that don't
provide transaction management FDW APIs can continue to work as before
even though this patch get committed.

The one point I'm concerned about this API design would be that since
both CommitForeignTransaction API and RollbackForeignTransaction API
will be used by two different kinds of process (backend and
transaction resolver processes), it might be hard to understand them
correctly for FDW developers.

I'd like to define new APIs so that FDW developers don't get confused.
Feedback is very welcome.

[1] https://wiki.postgresql.org/wiki/Japan_PostgreSQL_Developer_Meetup
[2] https://en.wikipedia.org/wiki/X/Open_XA
[3] The current API design I'm proposing has 6 APIs:  Prepare, Commit,
Rollback, Resolve, IsTwoPhaseEnabled and GetPrepareId. And these APIs
are devided based on who executes it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Ildar Musin

Date:

29 January 2019, 16:46:45

Hello,

The patch needs rebase as it doesn't apply to the current master. I applied it

to the older commit to test it. It worked fine so far.

I found one bug though which would cause resolver to finish by timeout even

though there are unresolved foreign transactions in the list. The

`fdw_xact_exists()` function expects database id as the first argument and xid

as the second. But everywhere it is called arguments specified in the different

order (xid first, then dbid). Also function declaration in header doesn't

match its definition.

There are some other things I found.

* In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is

declared as bool but used as integer.

* In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`

and `FdwXactMarkForeignTransactionModified()` functions mentioned that are

not there anymore.

* In documentation (storage.sgml) there is no mention of `pg_fdw_xact`

directory.

Couple of stylistic notes.

* In `FdwXactCtlData struct` there are both camel case and snake case naming

used.

* In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with

`TransactionIdIsValid(xid)`.

* In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format

string instead of being processed by `sprintf` as an extra argument.

I'll continue looking into the patch. Thanks!

On Tue, Nov 20, 2018 at 12:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 15, 2018 at 7:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 29, 2018 at 6:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 29, 2018 at 10:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Oct 24, 2018 at 9:06 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 23, 2018 at 12:54 PM Kyotaro HORIGUCHI
> > > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > >
> > > > > Hello.
> > > > >
> > > > > # It took a long time to come here..
> > > > >
> > > > > At Fri, 19 Oct 2018 21:38:35 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in <CAD21AoCBf-AJup-_ARfpqR42gJQ_XjNsvv-XE0rCOCLEkT=HCg@mail.gmail.com>
> > > > > > On Wed, Oct 10, 2018 at 1:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > ...
> > > > > > * Updated docs, added the new section "Distributed Transaction" at
> > > > > > Chapter 33 to explain the concept to users
> > > > > >
> > > > > > * Moved atomic commit codes into src/backend/access/fdwxact directory.
> > > > > >
> > > > > > * Some bug fixes.
> > > > > >
> > > > > > Please reivew them.
> > > > >
> > > > > I have some comments, with apologize in advance for possible
> > > > > duplicate or conflict with others' comments so far.
> > > >
> > > > Thank youf so much for reviewing this patch!
> > > >
> > > > >
> > > > > 0001:
> > > > >
> > > > > This sets XACT_FLAG_WROTENONTEMPREL when RELPERSISTENT_PERMANENT
> > > > > relation is modified. Isn't it needed when UNLOGGED tables are
> > > > > modified? It may be better that we have dedicated classification
> > > > > macro or function.
> > > >
> > > > I think even if we do atomic commit for modifying the an UNLOGGED
> > > > table and a remote table the data will get inconsistent if the local
> > > > server crashes. For example, if the local server crashes after
> > > > prepared the transaction on foreign server but before the local commit
> > > > and, we will lose the all data of the local UNLOGGED table whereas the
> > > > modification of remote table is rollbacked. In case of persistent
> > > > tables, the data consistency is left. So I think the keeping data
> > > > consistency between remote data and local unlogged table is difficult
> > > > and want to leave it as a restriction for now. Am I missing something?
> > > >
> > > > >
> > > > > The flag is handled in heapam.c. I suppose that it should be done
> > > > > in the upper layer considering coming pluggable storage.
> > > > > (X_F_ACCESSEDTEMPREL is set in heapam, but..)
> > > > >
> > > >
> > > > Yeah, or we can set the flag after heap_insert in ExecInsert.
> > > >
> > > > >
> > > > > 0002:
> > > > >
> > > > > The name FdwXactParticipantsForAC doesn't sound good for me. How
> > > > > about FdwXactAtomicCommitPartitcipants?
> > > >
> > > > +1, will fix it.
> > > >
> > > > >
> > > > > Well, as the file comment of fdwxact.c,
> > > > > FdwXactRegisterTransaction is called from FDW driver and
> > > > > F_X_MarkForeignTransactionModified is called from executor. I
> > > > > think that we should clarify who is responsible to the whole
> > > > > sequence. Since the state of local tables affects, I suppose
> > > > > executor is that. Couldn't we do the whole thing within executor
> > > > > side? I'm not sure but I feel that
> > > > > F_X_RegisterForeignTransaction can be a part of
> > > > > F_X_MarkForeignTransactionModified. The callers of
> > > > > MarkForeignTransactionModified can find whether the table is
> > > > > involved in 2pc by IsTwoPhaseCommitEnabled interface.
> > > >
> > > > Indeed. We can register foreign servers by executor while FDWs don't
> > > > need to register anything. I will remove the registration function so
> > > > that FDW developers don't need to call the register function but only
> > > > need to provide atomic commit APIs.
> > > >
> > > > >
> > > > >
> > > > > > if (foreign_twophase_commit == true &&
> > > > > > ((MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0) )
> > > > > > ereport(ERROR,
> > > > > > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > > > > > errmsg("cannot COMMIT a distributed transaction that has operated on foreign server that doesn't support atomic commit")));
> > > > >
> > > > > The error is emitted when a the GUC is turned off in the
> > > > > trasaction where MarkTransactionModify'ed. I think that the
> > > > > number of the variables' possible states should be reduced for
> > > > > simplicity. For example in the case, once foreign_twopase_commit
> > > > > is checked in a transaction, subsequent changes in the
> > > > > transaction should be ignored during the transaction.
> > > > >
> > > >
> > > > I might have not gotten your comment correctly but since the
> > > > foreign_twophase_commit is a PGC_USERSET parameter I think we need to
> > > > check it at commit time. Also we need to keep participant servers even
> > > > when foreign_twophase_commit is off if both max_prepared_foreign_xacts
> > > > and max_foreign_xact_resolvers are > 0.
> > > >
> > > > I will post the updated patch in this week.
> > > >
> > >
> > > Attached the updated version patches.
> > >
> > > Based on the review comment from Horiguchi-san, I've changed the
> > > atomic commit API so that the FDW developer who wish to support atomic
> > > commit don't need to call the register function. The atomic commit
> > > APIs are following:
> > >
> > > * GetPrepareId
> > > * PrepareForeignTransaction
> > > * CommitForeignTransaction
> > > * RollbackForeignTransaction
> > > * ResolveForeignTransaction
> > > * IsTwophaseCommitEnabled
> > >
> > > The all APIs except for GetPreapreId is required for atomic commit.
> > >
> > > Also, I've changed the foreign_twophase_commit parameter to an enum
> > > parameter based on the suggestion from Robert[1]. Valid values are
> > > 'required', 'prefer' and 'disabled' (default). When set to either
> > > 'required' or 'prefer' the atomic commit will be used. The difference
> > > between 'required' and 'prefer' is that when set to 'requried' we
> > > require for *all* modified server to be able to use 2pc whereas when
> > > 'prefer' we require 2pc where available. So if any of written
> > > participants disables 2pc or doesn't support atomic comit API the
> > > transaction fails. IOW, when 'required' we can commit only when data
> > > consistency among all participant can be left.
> > >
> > > Please review the patches.
> > >
> >
> > Since the previous patch conflicts with current HEAD attached updated
> > set of patches.
> >
>
> Rebased and fixed a few bugs.
>

I got feedbacks regarding transaciton management FDW APIs at Japan
PostgreSQL Developer Meetup[1] and am considering to change these APIs
to make them consistent with XA interface[2] (xa_prepare(),
xa_commit() and xa_rollback()) as follows[3].

* FdwXactResult PrepareForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult CommitForeignTransaction(FdwXactState *state, inf flags)
* FdwXactResult RollbackForeignTransaction(FdwXactState *state, inf flags)
* char *GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int
*prep_id_len)

Where flags set variaous setttings, currently it would contain only
FDW_XACT_FLAG_ONEPHASE that requires FDW to commit in one-phase (i.e.
without preparation). And where *state would contains information
necessary for specifying transaction: serverid, userid, usermappingid
and prepared id. GetPrepareId API is optional. Also I've removed the
two_phase_commit parameter from postgres_fdw options because we can
disable to use two-phase commit protocol for distributed transactions
using by distributed_atomic_commit GUC parameter.

Foreign transactions whose FDW provides both CommitForeignTransaction
API and RollbackForeignTransaction API will be managed by the global
transaction manager automatically. In addition, if the FDW also
provide PrepareForeignTransaction API it will participate to two-phase
commit protocol as a participant. So the existing FDWs that don't
provide transaction management FDW APIs can continue to work as before
even though this patch get committed.

The one point I'm concerned about this API design would be that since
both CommitForeignTransaction API and RollbackForeignTransaction API
will be used by two different kinds of process (backend and
transaction resolver processes), it might be hard to understand them
correctly for FDW developers.

I'd like to define new APIs so that FDW developers don't get confused.
Feedback is very welcome.

[1] https://wiki.postgresql.org/wiki/Japan_PostgreSQL_Developer_Meetup
[2] https://en.wikipedia.org/wiki/X/Open_XA
[3] The current API design I'm proposing has 6 APIs: Prepare, Commit,
Rollback, Resolve, IsTwoPhaseEnabled and GetPrepareId. And these APIs
are devided based on who executes it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

31 January 2019, 10:09:09

On Tue, Jan 29, 2019 at 5:47 PM Ildar Musin <ildar@adjust.com> wrote:
>
> Hello,
>
> The patch needs rebase as it doesn't apply to the current master. I applied it
> to the older commit to test it. It worked fine so far.

Thank you for testing the patch!

>
> I found one bug though which would cause resolver to finish by timeout even
> though there are unresolved foreign transactions in the list. The
> `fdw_xact_exists()` function expects database id as the first argument and xid
> as the second. But everywhere it is called arguments specified in the different
> order (xid first, then dbid).  Also function declaration in header doesn't
> match its definition.

Will fix.

>
> There are some other things I found.
> * In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is
>   declared as bool but used as integer.
> * In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`
>   and `FdwXactMarkForeignTransactionModified()` functions mentioned that are
>   not there anymore.
> * In documentation (storage.sgml) there is no mention of `pg_fdw_xact`
>   directory.
>
> Couple of stylistic notes.
> * In `FdwXactCtlData struct` there are both camel case and snake case naming
>   used.
> * In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with
>   `TransactionIdIsValid(xid)`.
> * In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format
>   string instead of being processed by `sprintf` as an extra argument.
>

I'll incorporate them at the next patch set.

> I'll continue looking into the patch. Thanks!

Thanks. Actually I'm updating the patch set, changing API interface as
I proposed before and improving the document and README. I'll submit
the latest patch next week.


--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Michael Paquier

Date:

04 February 2019, 04:56:09

On Thu, Jan 31, 2019 at 11:09:09AM +0100, Masahiko Sawada wrote:
> Thanks. Actually I'm updating the patch set, changing API interface as
> I proposed before and improving the document and README. I'll submit
> the latest patch next week.

Cool, I have moved the patch to next CF.
--
Michael

Attachment

signature.asc

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

17 April 2019, 10:17:35

On Thu, Jan 31, 2019 at 7:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 5:47 PM Ildar Musin <ildar@adjust.com> wrote:
> >
> > Hello,
> >
> > The patch needs rebase as it doesn't apply to the current master. I applied it
> > to the older commit to test it. It worked fine so far.
>
> Thank you for testing the patch!
>
> >
> > I found one bug though which would cause resolver to finish by timeout even
> > though there are unresolved foreign transactions in the list. The
> > `fdw_xact_exists()` function expects database id as the first argument and xid
> > as the second. But everywhere it is called arguments specified in the different
> > order (xid first, then dbid).  Also function declaration in header doesn't
> > match its definition.
>
> Will fix.
>
> >
> > There are some other things I found.
> > * In `FdwXactResolveAllDanglingTransactions()` variable `n_resolved` is
> >   declared as bool but used as integer.
> > * In fdwxact.c's module comment there are `FdwXactRegisterForeignTransaction()`
> >   and `FdwXactMarkForeignTransactionModified()` functions mentioned that are
> >   not there anymore.
> > * In documentation (storage.sgml) there is no mention of `pg_fdw_xact`
> >   directory.
> >
> > Couple of stylistic notes.
> > * In `FdwXactCtlData struct` there are both camel case and snake case naming
> >   used.
> > * In `get_fdw_xacts()` `xid != InvalidTransactionId` can be replaced with
> >   `TransactionIdIsValid(xid)`.
> > * In `generate_fdw_xact_identifier()` the `fx` prefix could be a part of format
> >   string instead of being processed by `sprintf` as an extra argument.
> >
>
> I'll incorporate them at the next patch set.
>
> > I'll continue looking into the patch. Thanks!
>
> Thanks. Actually I'm updating the patch set, changing API interface as
> I proposed before and improving the document and README. I'll submit
> the latest patch next week.
>

Sorry for the very late. Attached updated version patches.

The basic mechanism has not been changed since the previous version.
But the updated version patch uses the single wait queue instead of
two queues (active and retry) which were used in the previous version.

Every backends processes has a timestamp in PGPROC
(fdwXactNextResolutionTs), that is the time when  they expect to be
processed by foreign resolver process at. Entries in the wait queue is
ordered by theirs timestamps. The wait queue and timestamp are used
after a backend process prepared all transactions on foreign servers
and wait for all of them to be resolved.

Backend processes who are committing/aborting the distributed
transaction insert itself to the wait queue
(FdwXactRslvCtl->fdwxact_queue) with the current timestamp, and then
request to launch a new resolver process if not launched yet. If there
is resolver connecting to the same database they just set its latch.
The wait queue is protected by LWLock FdwXactResolutionLock. Then the
backend sleep until either user requests to cancel (press ctrl-c) or
waken up by resolver process.

Foreign resolver process continue to poll the wait queue, checking if
there is any waiter on the database that the resolver process connects
to. If there is a waiter, fetches it and check its timestamp. If the
current timestamp goes over its timestamp, the resolver process start
to resolve all foreign transactions. Usually backends processes insert
itself to wait queue first then wake up the resolver and they use the
same wall-clock, so the resolver can fetch the waiter just inserted.
Once all foreign transactions are resolved, the resolver process
delete the backend entry from the wait queue, and then wake up the
waiting backend.

On failure during foreign transaction resolution, while the backend is
still sleeping, the resolver process removes and inserts the backend
with the new timestamp (its timestamp
foreign_transaction_resolution_interval) to appropriate position in
the wait queue. This mechanism ensures that a distributed transaction
is resolved as soon as the waiter inserted while ensuring that the
resolver can retry to resolve the failed foreign transactions at a
interval of foreign_transaction_resolution_interval time.

For handling in-doubt transactions, I've removed the automatically
foreign transaction resolution code from the first version patch since
it's not essential feature and we can add it later. Therefore user
needs to resolve unresolved foreign transactions manually using by
pg_resolve_fdwxacts() function in three cases: where the foreign
server crashed or we lost connectibility to it during preparing
foreign transaction, where the coordinator node crashed during
preparing/resolving the foreign transaction and where user canceled to
resolve the foreign transaction.

For foreign transaction resolver processes, they exit if they don't
have any foreign transaction to resolve longer than
foreign_transaction_resolver_timeout. Since we cannot drop a database
while a resolver process is connecting to we can stop it call by
pg_stop_fdwxact_resolver() function.

The comment at top of fdwxact.c file describes about locking mechanism
and recovery, and src/backend/fdwxact/README descries about status
transition of FdwXact.

Also the wiki page[1] describes how to use this feature with some examples.

[1] https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Thomas Munro

Date:

01 July 2019, 11:32:00

On Wed, Apr 17, 2019 at 10:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Sorry for the very late. Attached updated version patches.

Hello Sawada-san,

Can we please have a fresh rebase?

Thanks,

-- 
Thomas Munro
https://enterprisedb.com

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

02 July 2019, 00:48:48

On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Apr 17, 2019 at 10:23 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Sorry for the very late. Attached updated version patches.
>
> Hello Sawada-san,
>
> Can we please have a fresh rebase?
>

Thank you for the notice. Attached rebased patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Alvaro Herrera

Date:

03 September 2019, 22:36:37

Hello Sawada-san,

On 2019-Jul-02, Masahiko Sawada wrote:

> On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:

> > Can we please have a fresh rebase?
> 
> Thank you for the notice. Attached rebased patches.

... and again?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

04 September 2019, 01:43:22

On Wed, Sep 4, 2019 at 7:36 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Hello Sawada-san,
>
> On 2019-Jul-02, Masahiko Sawada wrote:
>
> > On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > > Can we please have a fresh rebase?
> >
> > Thank you for the notice. Attached rebased patches.
>
> ... and again?
>

Thank you for the notice. I've attached rebased patch set.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

04 September 2019, 03:44:20

On Wed, Sep 4, 2019 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 4, 2019 at 7:36 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > Hello Sawada-san,
> >
> > On 2019-Jul-02, Masahiko Sawada wrote:
> >
> > > On Mon, Jul 1, 2019 at 8:32 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > > > Can we please have a fresh rebase?
> > >
> > > Thank you for the notice. Attached rebased patches.
> >
> > ... and again?
> >
>
> Thank you for the notice. I've attached rebased patch set.

I forgot to include some new header files. Attached the updated patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Michael Paquier

Date:

01 December 2019, 02:25:48

On Wed, Sep 04, 2019 at 12:44:20PM +0900, Masahiko Sawada wrote:
> I forgot to include some new header files. Attached the updated patches.

No reviews since and the patch does not apply anymore.  I am moving it
to next CF, waiting on author.
--
Michael

Attachment

signature.asc

Transactions involving multiple postgres foreign servers, take 2

From

Kyotaro Horiguchi

Date:

06 December 2019, 08:32:15

Hello.

This is the reased (and a bit fixed) version of the patch. This
applies on the master HEAD and passes all provided tests.

I took over this work from Sawada-san. I'll begin with reviewing the
current patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 733f1e413ef2b2fe1d3ecba41eb4cd8e355ab826 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 16:59:47 +0900
Subject: [PATCH v26 1/5] Keep track of writing on non-temporary relation

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/backend/executor/nodeModifyTable.c | 12 ++++++++++++
 src/include/access/xact.h              |  6 ++++++
 2 files changed, 18 insertions(+)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e3eb9d7b90..cd91f9c8a8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -587,6 +587,10 @@ ExecInsert(ModifyTableState *mtstate,
                                estate->es_output_cid,
                                0, NULL);
 
+            /* Make note that we've wrote on non-temprary relation */
+            if (RelationNeedsWAL(resultRelationDesc))
+                MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
             /* insert index entries for tuple */
             if (resultRelInfo->ri_NumIndices > 0)
                 recheckIndexes = ExecInsertIndexTuples(slot, estate, false, NULL,
@@ -938,6 +942,10 @@ ldelete:;
     if (tupleDeleted)
         *tupleDeleted = true;
 
+    /* Make note that we've wrote on non-temprary relation */
+    if (RelationNeedsWAL(resultRelationDesc))
+        MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
     /*
      * If this delete is the result of a partition key update that moved the
      * tuple to a new partition, put this row into the transition OLD TABLE,
@@ -1447,6 +1455,10 @@ lreplace:;
             recheckIndexes = ExecInsertIndexTuples(slot, estate, false, NULL, NIL);
     }
 
+    /* Make note that we've wrote on non-temprary relation */
+    if (RelationNeedsWAL(resultRelationDesc))
+        MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
     if (canSetTag)
         (estate->es_processed)++;
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 9d2899dea1..cb5c4935d2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -102,6 +102,12 @@ extern int    MyXactFlags;
  */
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK    (1U << 1)
 
+/*
+ * XACT_FLAGS_WROTENONTEMPREL - set when we wrote data on non-temporary
+ * relation.
+ */
+#define XACT_FLAGS_WROTENONTEMPREL                (1U << 2)
+
 /*
  *    start- and end-of-transaction callbacks for dynamically loaded modules
  */
-- 
2.23.0

From d21c72a7db85c2211504f60fca8d39c0bd0ee5a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:00:50 +0900
Subject: [PATCH v26 2/5] Support atomic commit among multiple foreign servers.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/backend/access/Makefile                   |    2 +-
 src/backend/access/fdwxact/Makefile           |   17 +
 src/backend/access/fdwxact/README             |  130 +
 src/backend/access/fdwxact/fdwxact.c          | 2816 +++++++++++++++++
 src/backend/access/fdwxact/launcher.c         |  644 ++++
 src/backend/access/fdwxact/resolver.c         |  344 ++
 src/backend/access/rmgrdesc/Makefile          |    1 +
 src/backend/access/rmgrdesc/fdwxactdesc.c     |   58 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    6 +-
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |   42 +
 src/backend/access/transam/xact.c             |   27 +-
 src/backend/access/transam/xlog.c             |   34 +-
 src/backend/catalog/system_views.sql          |   11 +
 src/backend/commands/copy.c                   |    6 +
 src/backend/commands/foreigncmds.c            |   30 +
 src/backend/executor/execPartition.c          |    8 +
 src/backend/executor/nodeForeignscan.c        |   24 +
 src/backend/executor/nodeModifyTable.c        |   18 +
 src/backend/foreign/foreign.c                 |   57 +
 src/backend/postmaster/bgworker.c             |    8 +
 src/backend/postmaster/pgstat.c               |   20 +
 src/backend/postmaster/postmaster.c           |   15 +-
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/ipc/procarray.c           |   46 +
 src/backend/storage/lmgr/lwlocknames.txt      |    3 +
 src/backend/storage/lmgr/proc.c               |    8 +
 src/backend/tcop/postgres.c                   |   14 +
 src/backend/utils/misc/guc.c                  |   82 +
 src/backend/utils/misc/postgresql.conf.sample |   16 +
 src/backend/utils/probes.d                    |    2 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetwal/pg_resetwal.c             |    2 +
 src/bin/pg_waldump/fdwxactdesc.c              |    1 +
 src/bin/pg_waldump/rmgrdesc.c                 |    1 +
 src/include/access/fdwxact.h                  |  165 +
 src/include/access/fdwxact_launcher.h         |   29 +
 src/include/access/fdwxact_resolver.h         |   23 +
 src/include/access/fdwxact_xlog.h             |   54 +
 src/include/access/resolver_internal.h        |   66 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/twophase.h                 |    1 +
 src/include/access/xact.h                     |    7 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.dat               |   29 +
 src/include/foreign/fdwapi.h                  |   12 +
 src/include/foreign/foreign.h                 |    1 +
 src/include/pgstat.h                          |    9 +-
 src/include/storage/proc.h                    |   11 +
 src/include/storage/procarray.h               |    5 +
 src/include/utils/guc_tables.h                |    3 +
 src/test/regress/expected/rules.out           |   13 +
 55 files changed, 4917 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/fdwxact/Makefile
 create mode 100644 src/backend/access/fdwxact/README
 create mode 100644 src/backend/access/fdwxact/fdwxact.c
 create mode 100644 src/backend/access/fdwxact/launcher.c
 create mode 100644 src/backend/access/fdwxact/resolver.c
 create mode 100644 src/backend/access/rmgrdesc/fdwxactdesc.c
 create mode 120000 src/bin/pg_waldump/fdwxactdesc.c
 create mode 100644 src/include/access/fdwxact.h
 create mode 100644 src/include/access/fdwxact_launcher.h
 create mode 100644 src/include/access/fdwxact_resolver.h
 create mode 100644 src/include/access/fdwxact_xlog.h
 create mode 100644 src/include/access/resolver_internal.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 0880e0a8bb..49480dd039 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS        = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-              table tablesample transam
+              table tablesample transam fdwxact
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/fdwxact/Makefile b/src/backend/access/fdwxact/Makefile
new file mode 100644
index 0000000000..0207a66fb4
--- /dev/null
+++ b/src/backend/access/fdwxact/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/fdwxact
+#
+# IDENTIFICATION
+#    src/backend/access/fdwxact/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/fdwxact
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = fdwxact.o resolver.o launcher.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/fdwxact/README b/src/backend/access/fdwxact/README
new file mode 100644
index 0000000000..46ccb7eeae
--- /dev/null
+++ b/src/backend/access/fdwxact/README
@@ -0,0 +1,130 @@
+src/backend/access/fdwxact/README
+
+Atomic Commit for Distributed Transactions
+===========================================
+
+The atomic commit feature enables us to commit and rollback either all of
+foreign servers or nothing. This ensures that the database data is always left
+in a conssitent state in term of federated database.
+
+
+Commit Sequence of Global Transactions
+--------------------------------
+
+We employee two-phase commit protocol to achieve commit among all foreign
+servers atomically. The sequence of distributed transaction commit consisnts
+of the following four steps:
+
+1. Foriegn Server Registration
+During executor node initialization, accessed foreign servers are registered
+to the list FdwXactAtomicCommitParticipants, which is maintained by
+PostgreSQL's the global transaction manager (GTM), as a distributed transaction
+participant The registered foreign transactions are tracked until the end of
+transaction.
+
+2. Pre-Commit phase (1st phase of two-phase commit)
+we record the corresponding WAL indicating that the foreign server is involved
+with the current transaction before doing PREPARE all foreign transactions.
+Thus in case we loose connectivity to the foreign server or crash ourselves,
+we will remember that we might have prepared tranascation on the foreign
+server, and try to resolve it when connectivity is restored or after crash
+recovery.
+
+The two-phase commit is required only if the transaction modified two or more
+servers including the local node. In other case, we can commit them at this
+step by calling CommitForeignTransaction() API and no need further operation.
+
+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change to
+rollback, therefore at this time some participants might be prepared whereas
+some are not prepared. The former foreign transactions need to be resolved
+using pg_resolve_foreign_xact() manually and the latter ends transaction
+in one-phase by calling RollbackForeignTransaction() API.
+
+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.
+
+4. Post-Commit Phase (2nd phase of two-phase commit)
+The steps so far are done by the backend process committing the transaction but
+this resolution step(commit or rollback) is done by the foreign transaction
+resolver process. The backend process inserts itselft to the wait queue, and
+then wake up the resolver process (or request to launch new one if necessary).
+The resolver process enqueue the waiter and fetch the distributed transaction
+information that the backend is waiting for. Once all foreign transaction are
+committed or rolbacked the resolver process wake up the waiter.
+
+
+API Contract With Transaction Management Callback Functions
+-----------------------------------------------------------
+
+The core GTM manages the status of individual foreign transactions and calls
+transaction management callback functions according to its status. Each
+callback functions PrepareForiegnTransaction, CommitForeignTransaction and
+RollbackForeignTransaction is responsible for either PREPARE, COMMIT or
+ROLLBACK the trasaction on the foreign server respectively.
+FdwXactRslvState->flags could contain FDWXACT_FLAG_ONEPHASE, meaning FDW can
+commit or rollback the foreign transactio in one-phase. On failure during
+processing a foreign transaction, FDW needs to raise an error. However, FDW
+must accept ERRCODE_UNDEFINED_OBJECT error during committing or rolling back a
+foreign transaction, because there is a race condition that the coordinator
+could crash in time between the resolution is completed and writing the WAL
+removing the FdwXact entry.
+
+
+Foreign Transactions Status
+----------------------------
+
+Every foreign transactions has an FdwXact entry. When preparing a foreign
+transaction a FdwXact entry of which status starts from FDWXACT_STATUS_INITIAL
+are created with WAL logging. The status changes to FDWXACT_STATUS_PREPARED
+after the foreign transaction is prepared and it changes to
+FDWXACT_STATUS_PREPARING, FDWXACT_STATUS_COMMITTING and FDWXACT_STATUS_ABORTING
+before the foreign transaction is prepared, committed and aborted by FDW
+callback functions respectively(*1). And the status then changes to
+FDWXACT_STATUS_RESOLVED once the foreign transaction are resolved, and then
+the corresponding FdwXact entry is removed with WAL logging. If failed during
+processing foreign transaction (i.g. preparing, committing or aborting) the
+status changes back to the previous status. Therefore the status
+FDWXACT_STATUS_xxxING appear only during the foreign transaction is being
+processed by an FDW callback function.
+
+FdwXact entries recovered during the recovery are marked as in-doubt if the
+corresponding local transaction is not prepared transaction. The initial
+status is FDWXACT_STATUS_PREPARED(*2). Because the foreign transaction was
+being processed we cannot know the exact status. So we regard it as PREPARED
+for safety.
+
+The foreign transaction status transition is illustrated by the following graph
+describing the FdwXact->status:
+
+ +----------------------------------------------------+
+ |                      INVALID                       |
+ +----------------------------------------------------+
+    |                      |                       |
+    |                      v                       |
+    |           +---------------------+            |
+    |           |       INITIAL       |            |
+    |           +---------------------+            |
+   (*2)                    |                      (*2)
+    |                      v                       |
+    |           +---------------------+            |
+    |           |    PREPARING(*1)    |            |
+    |           +---------------------+            |
+    |                      |                       |
+    v                      v                       v
+ +----------------------------------------------------+
+ |                      PREPARED                      |
+ +----------------------------------------------------+
+           |                               |
+           v                               v
+ +--------------------+          +--------------------+
+ |   COMMITTING(*1)   |          |    ABORTING(*1)    |
+ +--------------------+          +--------------------+
+           |                               |
+           v                               v
+ +----------------------------------------------------+
+ |                      RESOLVED                      |
+ +----------------------------------------------------+
+
+(*1) Status that appear only during being processed by FDW
+(*2) Paths for recovered FdwXact entries
diff --git a/src/backend/access/fdwxact/fdwxact.c b/src/backend/access/fdwxact/fdwxact.c
new file mode 100644
index 0000000000..058a416f81
--- /dev/null
+++ b/src/backend/access/fdwxact/fdwxact.c
@@ -0,0 +1,2816 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact.c
+ *        PostgreSQL global transaction manager for foreign servers.
+ *
+ * To achieve commit among all foreign servers automically, we employee
+ * two-phase commit protocol, which is a type of atomic commitment
+ * protocol(ACP). The basic strategy is that we prepare all of the remote
+ * transactions before committing locally and commit them after committing
+ * locally.
+ *
+ * During executor node initialization, they can register the foreign server
+ * by calling either RegisterFdwXactByRelId() or RegisterFdwXactByServerId()
+ * to participate it to a group for global commit. The foreign servers are
+ * registered if FDW has both CommitForeignTransaciton API and
+ * RollbackForeignTransactionAPI. Registered participant servers are identified
+ * by OIDs of foreign server and user.
+ *
+ * During pre-commit of local transaction, we prepare the transaction on
+ * foreign server everywhere. And after committing or rolling back locally,
+ * we notify the resolver process and tell it to commit or rollback those
+ * transactions. If we ask it to commit, we also tell it to notify us when
+ * it's done, so that we can wait interruptibly for it to finish, and so
+ * that we're not trying to locally do work that might fail after foreign
+ * transaction are committed.
+ *
+ * The best performing way to manage the waiting backends is to have a
+ * queue of waiting backends, so that we can avoid searching the through all
+ * foreign transactions each time we receive a request. We have one queue
+ * of which elements are ordered by the timestamp that they expect to be
+ * processed at. Before waiting for foreign transactions being resolved the
+ * backend enqueues with the timestamp that they expects to be processed.
+ * Similary if failed to resolve them, it enqueues again with new timestamp
+ * (its timestamp + foreign_xact_resolution_interval).
+ *
+ * If any network failure, server crash occurs or user stopped waiting
+ * prepared foreign transactions are left in in-doubt state (aka. in-doubt
+ * transaction). Foreign transactions in in-doubt state are not resolved
+ * automatically so must be processed manually using by pg_resovle_fdwxact()
+ * function.
+ *
+ * Two-phase commit protocol is required if the transaction modified two or
+ * more servers including itself. In other case, all foreign transactions are
+ * committed or rolled back during pre-commit.
+ *
+ * LOCKING
+ *
+ * Whenever a foreign transaction is processed by FDW, the corresponding
+ * FdwXact entry is update. In order to protect the entry from concurrent
+ * removing we need to hold a lock on the entry or a lock for entire global
+ * array. However, we don't want to hold the lock during FDW is processing the
+ * foreign transaction that may take a unpredictable time. To avoid this, the
+ * in-memory data of foreign transaction follows a locking model based on
+ * four linked concepts:
+ *
+ * * A foreign transaction's status variable is switched using the LWLock
+ *   FdwXactLock, which need to be hold in exclusive mode when updating the
+ *   status, while readers need to hold it in shared mode when looking at the
+ *   status.
+ * * A process who is going to update FdwXact entry cannot process foreign
+ *   transaction that is being resolved.
+ * * So setting the status to FDWACT_STATUS_PREPARING,
+ *   FDWXACT_STATUS_COMMITTING or FDWXACT_STATUS_ABORTING, which makes foreign
+ *   transaction in-progress states, means to own the FdwXact entry, which
+ *   protect it from updating/removing by concurrent writers.
+ * * Individual fields are protected by mutex where only the backend owning
+ *   the foreign transaction is authorized to update the fields from its own
+ *   one.
+
+ * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
+ * process who is going to call transaction callback functions needs to change
+ * the status to the corresponding status above while holding FdwXactLock in
+ * exclusive mode, and call callback function after releasing the lock.
+ *
+ * RECOVERY
+ *
+ * During replay WAL and replication FdwXactCtl also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdwxact records happens by the following rules:
+ *
+ * * At the beginning of recovery, pg_fdwxacts is scanned once, filling FdwXact
+ *   with entries marked with fdwxact->inredo and fdwxact->ondisk. FdwXact file
+ *   data older than the XID horizon of the redo position are discarded.
+ * * On PREPARE redo, the foreign transaction is added to FdwXactCtl->fdwxacts.
+ *   We set fdwxact->inredo to true for such entries.
+ * * On Checkpoint we iterate through FdwXactCtl->fdwxacts entries that
+ *   have fdwxact->inredo set and are behind the redo_horizon. We save
+ *   them to disk and then set fdwxact->ondisk to true.
+ * * On resolution we delete the entry from FdwXactCtl->fdwxacts. If
+ *   fdwxact->ondisk is true, the corresponding entry from the disk is
+ *   additionally deleted.
+ * * RecoverFdwXacts() and PrescanFdwXacts() have been modified to go through
+ *   fdwxact->inredo entries that have not made it to dink.
+ *
+ * These replay rules are borrowed from twophase.c
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *    src/backend/access/fdwxact/fdwxact.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_xlog.h"
+#include "access/resolver_internal.h"
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "parser/parsetree.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/ps_status.h"
+#include "utils/rel.h"
+#include "utils/snapmgr.h"
+
+/* Atomic commit is enabled by configuration */
+#define IsForeignTwophaseCommitEnabled() \
+    (max_prepared_foreign_xacts > 0 && \
+     max_foreign_xact_resolvers > 0)
+
+/* Foreign twophase commit is enabled and requested by user */
+#define IsForeignTwophaseCommitRequested() \
+    (IsForeignTwophaseCommitEnabled() && \
+     (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED))
+
+/* Check the FdwXactParticipant is capable of two-phase commit  */
+#define IsSeverCapableOfTwophaseCommit(fdw_part) \
+    (((FdwXactParticipant *)(fdw_part))->prepare_foreign_xact_fn != NULL)
+
+/* Check the FdwXact is begin resolved */
+#define FdwXactIsBeingResolved(fx) \
+    (((((FdwXact)(fx))->status) == FDWXACT_STATUS_PREPARING) || \
+     ((((FdwXact)(fx))->status) == FDWXACT_STATUS_COMMITTING) || \
+     ((((FdwXact)(fx))->status) == FDWXACT_STATUS_ABORTING))
+
+/*
+ * Structure to bundle the foreign transaction participant. This struct
+ * is created at the beginning of execution for each foreign servers and
+ * is used until the end of transaction where we cannot look at syscaches.
+ * Therefore, this is allocated in the TopTransactionContext.
+ */
+typedef struct FdwXactParticipant
+{
+    /*
+     * Pointer to a FdwXact entry in the global array. NULL if the entry
+     * is not inserted yet but this is registered as a participant.
+     */
+    FdwXact        fdwxact;
+
+    /* Foreign server and user mapping info, passed to callback routines */
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+
+    /* Transaction identifier used for PREPARE */
+    char            *fdwxact_id;
+
+    /* true if modified the data on the server */
+    bool            modified;
+
+    /* Callbacks for foreign transaction */
+    PrepareForeignTransaction_function    prepare_foreign_xact_fn;
+    CommitForeignTransaction_function    commit_foreign_xact_fn;
+    RollbackForeignTransaction_function    rollback_foreign_xact_fn;
+    GetPrepareId_function                get_prepareid_fn;
+} FdwXactParticipant;
+
+/*
+ * List of foreign transaction participants for atomic commit. This list
+ * has only foreign servers that provides transaction management callbacks,
+ * that is CommitForeignTransaction and RollbackForeignTransaction.
+ */
+static List *FdwXactParticipants = NIL;
+static bool ForeignTwophaseCommitIsRequired = false;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDWXACTS_DIR "pg_fdwxact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes database oid,
+ * xid, foreign server oid and user oid separated by '_'.
+ *
+ * Since FdwXact stat file is created per foreign transaction in a
+ * distributed transaction and the xid of unresolved distributed
+ * transaction never reused, the name is fairly enough to ensure
+ * uniqueness.
+ */
+#define FDWXACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FdwXactFilePath(path, dbid, xid, serverid, userid)    \
+    snprintf(path, MAXPGPATH, FDWXACTS_DIR "/%08X_%08X_%08X_%08X", \
+             dbid, xid, serverid, userid)
+
+/* Guc parameters */
+int    max_prepared_foreign_xacts = 0;
+int    max_foreign_xact_resolvers = 0;
+int foreign_twophase_commit = FOREIGN_TWOPHASE_COMMIT_DISABLED;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+static FdwXact FdwXactInsertFdwXactEntry(TransactionId xid,
+                                         FdwXactParticipant *fdw_part);
+static void FdwXactPrepareForeignTransactions(void);
+static void FdwXactOnePhaseEndForeignTransaction(FdwXactParticipant *fdw_part,
+                                                 bool for_commit);
+static void FdwXactResolveForeignTransaction(FdwXact fdwxact,
+                                             FdwXactRslvState *state,
+                                             FdwXactStatus fallback_status);
+static void FdwXactComputeRequiredXmin(void);
+static void FdwXactCancelWait(void);
+static void FdwXactRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn);
+static void FdwXactRedoRemove(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, bool give_warnings);
+static void FdwXactQueueInsert(PGPROC *waiter);
+static void AtProcExit_FdwXact(int code, Datum arg);
+static void ForgetAllFdwXactParticipants(void);
+static char *ReadFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                             Oid userid);
+static void RemoveFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, bool giveWarning);
+static void RecreateFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                                Oid userid,    void *content, int len);
+static void XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len);
+static char *ProcessFdwXactBuffer(Oid dbid, TransactionId local_xid,
+                                  Oid serverid, Oid userid,
+                                  XLogRecPtr insert_start_lsn,
+                                  bool from_disk);
+static void FdwXactDetermineTransactionFate(FdwXact fdwxact, bool need_lock);
+static bool is_foreign_twophase_commit_required(void);
+static void register_fdwxact(Oid serverid, Oid userid, bool modified);
+static List *get_fdwxacts(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                          bool including_indoubts, bool include_in_progress,
+                          bool need_lock);
+static FdwXact get_all_fdwxacts(int *num_p);
+static FdwXact insert_fdwxact(Oid dbid, TransactionId xid, Oid serverid,
+                              Oid userid, Oid umid, char *fdwxact_id);
+static char *get_fdwxact_identifier(FdwXactParticipant *fdw_part,
+                                    TransactionId xid);
+static void remove_fdwxact(FdwXact fdwxact);
+static FdwXact get_fdwxact_to_resolve(Oid dbid, TransactionId xid);
+static FdwXactRslvState *create_fdwxact_state(void);
+
+#ifdef USE_ASSERT_CHECKING
+static bool FdwXactQueueIsOrderedByTimestamp(void);
+#endif
+
+/*
+ * Remember accessed foreign transaction. Both RegisterFdwXactByRelId and
+ * RegisterFdwXactByServerId are called by executor during initialization.
+ */
+void
+RegisterFdwXactByRelId(Oid relid, bool modified)
+{
+    Relation        rel;
+    Oid                serverid;
+    Oid                userid;
+
+    rel = relation_open(relid, NoLock);
+    serverid = GetForeignServerIdByRelId(relid);
+    userid = rel->rd_rel->relowner ? rel->rd_rel->relowner : GetUserId();
+    relation_close(rel, NoLock);
+
+    register_fdwxact(serverid, userid, modified);
+}
+
+void
+RegisterFdwXactByServerId(Oid serverid, bool modified)
+{
+    register_fdwxact(serverid, GetUserId(), modified);
+}
+
+/*
+ * Register given foreign transaction identified by given arguments as
+ * a participant of the transaction.
+ *
+ * The foreign transaction identified by given server id and user id.
+ * Registered foreign transactions are managed by the global transaction
+ * manager until the end of the transaction.
+ */
+static void
+register_fdwxact(Oid serverid, Oid userid, bool modified)
+{
+    FdwXactParticipant    *fdw_part;
+    ForeignServer         *foreign_server;
+    UserMapping            *user_mapping;
+    MemoryContext        old_ctx;
+    FdwRoutine            *routine;
+    ListCell               *lc;
+
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant    *fdw_part = (FdwXactParticipant *) lfirst(lc);
+
+        if (fdw_part->server->serverid == serverid &&
+            fdw_part->usermapping->userid == userid)
+        {
+            /* The foreign server is already registered, return */
+            fdw_part->modified |= modified;
+            return;
+        }
+    }
+
+    /*
+     * Participant's information is also needed at the end of a transaction,
+     * where system cache are not available. Save it in TopTransactionContext
+     * so that these can live until the end of transaction.
+     */
+    old_ctx = MemoryContextSwitchTo(TopTransactionContext);
+    routine = GetFdwRoutineByServerId(serverid);
+
+    /*
+     * Don't register foreign server if it doesn't provide both commit and
+     * rollback transaction management callbacks.
+     */
+    if (!routine->CommitForeignTransaction ||
+        !routine->RollbackForeignTransaction)
+    {
+        MyXactFlags |= XACT_FLAGS_FDWNOPREPARE;
+        pfree(routine);
+        return;
+    }
+
+    /*
+     * Remember we touched the foreign server that is not capable of two-phase
+     * commit.
+     */
+    if (!routine->PrepareForeignTransaction)
+        MyXactFlags |= XACT_FLAGS_FDWNOPREPARE;
+
+    foreign_server = GetForeignServer(serverid);
+    user_mapping = GetUserMapping(userid, serverid);
+
+
+    fdw_part = (FdwXactParticipant *) palloc(sizeof(FdwXactParticipant));
+
+    fdw_part->fdwxact_id = NULL;
+    fdw_part->server = foreign_server;
+    fdw_part->usermapping = user_mapping;
+    fdw_part->fdwxact = NULL;
+    fdw_part->modified = modified;
+    fdw_part->prepare_foreign_xact_fn = routine->PrepareForeignTransaction;
+    fdw_part->commit_foreign_xact_fn = routine->CommitForeignTransaction;
+    fdw_part->rollback_foreign_xact_fn = routine->RollbackForeignTransaction;
+    fdw_part->get_prepareid_fn = routine->GetPrepareId;
+
+    /* Add to the participants list */
+    FdwXactParticipants = lappend(FdwXactParticipants, fdw_part);
+
+    /* Revert back the context */
+    MemoryContextSwitchTo(old_ctx);
+}
+
+/*
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+Size
+FdwXactShmemSize(void)
+{
+    Size        size;
+
+    /* Size for foreign transaction information array */
+    size = offsetof(FdwXactCtlData, fdwxacts);
+    size = add_size(size, mul_size(max_prepared_foreign_xacts,
+                                   sizeof(FdwXact)));
+    size = MAXALIGN(size);
+    size = add_size(size, mul_size(max_prepared_foreign_xacts,
+                                   sizeof(FdwXactData)));
+
+    return size;
+}
+
+/*
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of FdwXactCtlData
+ * structure.
+ */
+void
+FdwXactShmemInit(void)
+{
+    bool        found;
+
+    if (!fdwXactExitRegistered)
+    {
+        before_shmem_exit(AtProcExit_FdwXact, 0);
+        fdwXactExitRegistered = true;
+    }
+
+    FdwXactCtl = ShmemInitStruct("Foreign transactions table",
+                                 FdwXactShmemSize(),
+                                 &found);
+    if (!IsUnderPostmaster)
+    {
+        FdwXact        fdwxacts;
+        int            cnt;
+
+        Assert(!found);
+        FdwXactCtl->free_fdwxacts = NULL;
+        FdwXactCtl->num_fdwxacts = 0;
+
+        /* Initialize the linked list of free FDW transactions */
+        fdwxacts = (FdwXact)
+            ((char *) FdwXactCtl +
+             MAXALIGN(offsetof(FdwXactCtlData, fdwxacts) +
+                      sizeof(FdwXact) * max_prepared_foreign_xacts));
+        for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+        {
+            fdwxacts[cnt].status = FDWXACT_STATUS_INVALID;
+            fdwxacts[cnt].fdwxact_free_next = FdwXactCtl->free_fdwxacts;
+            FdwXactCtl->free_fdwxacts = &fdwxacts[cnt];
+            SpinLockInit(&(fdwxacts[cnt].mutex));
+        }
+    }
+    else
+    {
+        Assert(FdwXactCtl);
+        Assert(found);
+    }
+}
+
+/*
+ * Prepare all foreign transactions if foreign twophase commit is required.
+ * If foreign twophase commit is required, the behavior depends on the value
+ * of foreign_twophase_commit; when 'required' we strictly require for all
+ * foreign server's FDWs to support two-phase commit protocol and ask them to
+ *  prepare foreign transactions, when 'prefer' we ask only foreign servers
+ * that are capable of two-phase commit to prepare foreign transactions and ask
+ * for other servers to commit, and for 'disabled' we ask all foreign servers
+ * to commit foreign transaction in one-phase. If we failed to commit any of
+ * them we change to aborting.
+ *
+ * Note that non-modified foreign servers always can be committed without
+ * preparation.
+ */
+void
+PreCommit_FdwXacts(void)
+{
+    bool        need_twophase_commit;
+    ListCell    *lc = NULL;
+
+    /* If there are no foreign servers involved, we have no business here */
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /*
+     * we require all modified server have to be capable of two-phase
+     * commit protocol.
+     */
+    if (foreign_twophase_commit == FOREIGN_TWOPHASE_COMMIT_REQUIRED &&
+        (MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot COMMIT a distributed transaction that has operated on foreign server that doesn't
supportatomic commit")));
 
+
+    /*
+     * Check if we need to use foreign twophase commit. It's always false
+     * if foreign twophase commit is disabled.
+     */
+    need_twophase_commit = is_foreign_twophase_commit_required();
+
+    /*
+     * Firstly, we consider to commit foreign transactions in one-phase.
+     */
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
+        bool    commit = false;
+
+        /* Can commit in one-phase if two-phase commit is not requried */
+        if (!need_twophase_commit)
+            commit = true;
+
+        /* Non-modified foreign transaction always can be committed in one-phase */
+        if (!fdw_part->modified)
+            commit = true;
+
+        /*
+         * In 'prefer' case, non-twophase-commit capable server can be
+         * committed in one-phase.
+         */
+        if (foreign_twophase_commit == FOREIGN_TWOPHASE_COMMIT_PREFER &&
+            !IsSeverCapableOfTwophaseCommit(fdw_part))
+            commit = true;
+
+        if (commit)
+        {
+            /* Commit the foreign transaction in one-phase */
+            FdwXactOnePhaseEndForeignTransaction(fdw_part, true);
+
+            /* Delete it from the participant list */
+            FdwXactParticipants = foreach_delete_current(FdwXactParticipants,
+                                                         lc);
+            continue;
+        }
+    }
+
+    /* All done if we committed all foreign transactions */
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /*
+     * Secondary, if only one transaction is remained in the participant list
+     * and we didn't modified the local data we can commit it without
+     * preparation.
+     */
+    if (list_length(FdwXactParticipants) == 1 &&
+        (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) == 0)
+    {
+        /* Commit the foreign transaction in one-phase */
+        FdwXactOnePhaseEndForeignTransaction(linitial(FdwXactParticipants),
+                                             true);
+
+        /* All foreign transaction must be committed */
+        list_free(FdwXactParticipants);
+        return;
+    }
+
+    /*
+     * Finally, prepare foreign transactions. Note that we keep
+     * FdwXactParticipants until the end of transaction.
+     */
+    FdwXactPrepareForeignTransactions();
+}
+
+/*
+ * Insert FdwXact entries and prepare foreign transactions. Before inserting
+ * FdwXact entry we call get_preparedid callback to get a transaction
+ * identifier from FDW.
+ *
+ * We still can change to rollback here. If any error occurs, we rollback
+ * non-prepared foreign trasactions and leave others to the resolver.
+ */
+static void
+FdwXactPrepareForeignTransactions(void)
+{
+    ListCell        *lcell;
+    TransactionId    xid;
+
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /* Parameter check */
+    if (max_prepared_foreign_xacts == 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("prepread foreign transactions are disabled"),
+                 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+    if (max_foreign_xact_resolvers == 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("prepread foreign transactions are disabled"),
+                 errhint("Set max_foreign_transaction_resolvers to a nonzero value.")));
+
+    xid = GetTopTransactionId();
+
+    /* Loop over the foreign connections */
+    foreach(lcell, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lcell);
+        FdwXactRslvState     *state;
+        FdwXact        fdwxact;
+
+        fdw_part->fdwxact_id = get_fdwxact_identifier(fdw_part, xid);
+
+        Assert(fdw_part->fdwxact_id);
+
+        /*
+         * Insert the foreign transaction entry with the FDWXACT_STATUS_PREPARING
+         * status. Registration persists this information to the disk and logs
+         * (that way relaying it on standby). Thus in case we loose connectivity
+         * to the foreign server or crash ourselves, we will remember that we
+         * might have prepared transaction on the foreign server and try to
+         * resolve it when connectivity is restored or after crash recovery.
+         *
+         * If we prepare the transaction on the foreign server before persisting
+         * the information to the disk and crash in-between these two steps,
+         * we will forget that we prepared the transaction on the foreign server
+         * and will not be able to resolve it after the crash. Hence persist
+         * first then prepare.
+         */
+        fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
+
+        state = create_fdwxact_state();
+        state->server = fdw_part->server;
+        state->usermapping = fdw_part->usermapping;
+        state->fdwxact_id = pstrdup(fdw_part->fdwxact_id);
+
+        /* Update the status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        Assert(fdwxact->status == FDWXACT_STATUS_INITIAL);
+        fdwxact->status = FDWXACT_STATUS_PREPARING;
+        LWLockRelease(FdwXactLock);
+
+        /*
+         * Prepare the foreign transaction.
+         *
+         * Between FdwXactInsertFdwXactEntry call till this backend hears
+         * acknowledge from foreign server, the backend may abort the local
+         * transaction (say, because of a signal).
+         *
+         * During abort processing, we might try to resolve a never-preapred
+         * transaction, and get an error. This is fine as long as the FDW
+         * provides us unique prepared transaction identifiers.
+         */
+        PG_TRY();
+        {
+            fdw_part->prepare_foreign_xact_fn(state);
+        }
+        PG_CATCH();
+        {
+            /* failed, back to the initial state */
+            LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+            fdwxact->status = FDWXACT_STATUS_INITIAL;
+            LWLockRelease(FdwXactLock);
+
+            PG_RE_THROW();
+        }
+        PG_END_TRY();
+
+        /* succeeded, update status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        fdwxact->status = FDWXACT_STATUS_PREPARED;
+        LWLockRelease(FdwXactLock);
+    }
+}
+
+/*
+ * One-phase commit or rollback the given foreign transaction participant.
+ */
+static void
+FdwXactOnePhaseEndForeignTransaction(FdwXactParticipant *fdw_part,
+                                     bool for_commit)
+{
+    FdwXactRslvState *state;
+
+    Assert(fdw_part->commit_foreign_xact_fn);
+    Assert(fdw_part->rollback_foreign_xact_fn);
+
+    state = create_fdwxact_state();
+    state->server = fdw_part->server;
+    state->usermapping = fdw_part->usermapping;
+    state->flags = FDWXACT_FLAG_ONEPHASE;
+
+    /*
+     * Commit or rollback foreign transaction in one-phase. Since we didn't
+     * insert FdwXact entry for this transaction we don't need to care
+     * failures. On failure we change to rollback.
+     */
+    if (for_commit)
+        fdw_part->commit_foreign_xact_fn(state);
+    else
+        fdw_part->rollback_foreign_xact_fn(state);
+}
+
+/*
+ * This function is used to create new foreign transaction entry before an FDW
+ * prepares and commit/rollback. The function adds the entry to WAL and it will
+ * be persisted to the disk under pg_fdwxact directory when checkpoint.
+ */
+static FdwXact
+FdwXactInsertFdwXactEntry(TransactionId xid, FdwXactParticipant *fdw_part)
+{
+    FdwXact                fdwxact;
+    FdwXactOnDiskData    *fdwxact_file_data;
+    MemoryContext        old_context;
+    int                    data_len;
+
+    old_context = MemoryContextSwitchTo(TopTransactionContext);
+
+    /*
+     * Enter the foreign transaction in the shared memory structure.
+     */
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    fdwxact = insert_fdwxact(MyDatabaseId, xid, fdw_part->server->serverid,
+                            fdw_part->usermapping->userid,
+                            fdw_part->usermapping->umid, fdw_part->fdwxact_id);
+    fdwxact->status = FDWXACT_STATUS_INITIAL;
+    fdwxact->held_by = MyBackendId;
+    LWLockRelease(FdwXactLock);
+
+    fdw_part->fdwxact = fdwxact;
+    MemoryContextSwitchTo(old_context);
+
+    /*
+     * Prepare to write the entry to a file. Also add xlog entry. The contents
+     * of the xlog record are same as what is written to the file.
+     */
+    data_len = offsetof(FdwXactOnDiskData, fdwxact_id);
+    data_len = data_len + strlen(fdw_part->fdwxact_id) + 1;
+    data_len = MAXALIGN(data_len);
+    fdwxact_file_data = (FdwXactOnDiskData *) palloc0(data_len);
+    fdwxact_file_data->dbid = MyDatabaseId;
+    fdwxact_file_data->local_xid = xid;
+    fdwxact_file_data->serverid = fdw_part->server->serverid;
+    fdwxact_file_data->userid = fdw_part->usermapping->userid;
+    fdwxact_file_data->umid = fdw_part->usermapping->umid;
+    memcpy(fdwxact_file_data->fdwxact_id, fdw_part->fdwxact_id,
+           strlen(fdw_part->fdwxact_id) + 1);
+
+    /* See note in RecordTransactionCommit */
+    MyPgXact->delayChkpt = true;
+
+    START_CRIT_SECTION();
+
+    /* Add the entry in the xlog and save LSN for checkpointer */
+    XLogBeginInsert();
+    XLogRegisterData((char *) fdwxact_file_data, data_len);
+    fdwxact->insert_end_lsn = XLogInsert(RM_FDWXACT_ID, XLOG_FDWXACT_INSERT);
+    XLogFlush(fdwxact->insert_end_lsn);
+
+    /* If we crash now, we have prepared: WAL replay will fix things */
+
+    /* Store record's start location to read that later on CheckPoint */
+    fdwxact->insert_start_lsn = ProcLastRecPtr;
+
+    /* File is written completely, checkpoint can proceed with syncing */
+    fdwxact->valid = true;
+
+    /* Checkpoint can process now */
+    MyPgXact->delayChkpt = false;
+
+    END_CRIT_SECTION();
+
+    pfree(fdwxact_file_data);
+    return fdwxact;
+}
+
+/*
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, into the shared memory array. Caller
+ * must hold FdwXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FdwXact
+insert_fdwxact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                Oid umid, char *fdwxact_id)
+{
+    int i;
+    FdwXact fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* Check for duplicated foreign transaction entry */
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        fdwxact = FdwXactCtl->fdwxacts[i];
+        if (fdwxact->dbid == dbid &&
+            fdwxact->local_xid == xid &&
+            fdwxact->serverid == serverid &&
+            fdwxact->userid == userid)
+            ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
+                            errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
+                                   xid, serverid, userid)));
+    }
+
+    /*
+     * Get a next free foreign transaction entry. Raise error if there are
+     * none left.
+     */
+    if (!FdwXactCtl->free_fdwxacts)
+    {
+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("maximum number of foreign transactions reached"),
+                 errhint("Increase max_prepared_foreign_transactions: \"%d\".",
+                         max_prepared_foreign_xacts)));
+    }
+    fdwxact = FdwXactCtl->free_fdwxacts;
+    FdwXactCtl->free_fdwxacts = fdwxact->fdwxact_free_next;
+
+    /* Insert the entry to shared memory array */
+    Assert(FdwXactCtl->num_fdwxacts < max_prepared_foreign_xacts);
+    FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts++] = fdwxact;
+
+    fdwxact->held_by = InvalidBackendId;
+    fdwxact->dbid = dbid;
+    fdwxact->local_xid = xid;
+    fdwxact->serverid = serverid;
+    fdwxact->userid = userid;
+    fdwxact->umid = umid;
+    fdwxact->insert_start_lsn = InvalidXLogRecPtr;
+    fdwxact->insert_end_lsn = InvalidXLogRecPtr;
+    fdwxact->valid = false;
+    fdwxact->ondisk = false;
+    fdwxact->inredo = false;
+    fdwxact->indoubt = false;
+    memcpy(fdwxact->fdwxact_id, fdwxact_id, strlen(fdwxact_id) + 1);
+
+    return fdwxact;
+}
+
+/*
+ * Remove the foreign prepared transaction entry from shared memory.
+ * Caller must hold FdwXactLock in exclusive mode.
+ */
+static void
+remove_fdwxact(FdwXact fdwxact)
+{
+    int i;
+
+    Assert(fdwxact != NULL);
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    if (FdwXactIsBeingResolved(fdwxact))
+        elog(ERROR, "cannot remove fdwxact entry that is beging resolved");
+
+    /* Search the slot where this entry resided */
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        if (FdwXactCtl->fdwxacts[i] == fdwxact)
+            break;
+    }
+
+    /* We did not find the given entry in the array */
+    if (i >= FdwXactCtl->num_fdwxacts)
+        ereport(ERROR,
+                (errmsg("could not remove a foreign transaction entry"),
+                 errdetail("failed to find entry for xid %u, foreign server %u, and user %u",
+                           fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+
+    elog(DEBUG2, "remove fdwxact entry id %s, xid %u db %d user %d",
+         fdwxact->fdwxact_id, fdwxact->local_xid, fdwxact->dbid,
+         fdwxact->userid);
+
+    /* Remove the entry from active array */
+    FdwXactCtl->num_fdwxacts--;
+    FdwXactCtl->fdwxacts[i] = FdwXactCtl->fdwxacts[FdwXactCtl->num_fdwxacts];
+
+    /* Put it back into free list */
+    fdwxact->fdwxact_free_next = FdwXactCtl->free_fdwxacts;
+    FdwXactCtl->free_fdwxacts = fdwxact;
+
+    /* Reset informations */
+    fdwxact->status = FDWXACT_STATUS_INVALID;
+    fdwxact->held_by = InvalidBackendId;
+    fdwxact->indoubt = false;
+
+    if (!RecoveryInProgress())
+    {
+        xl_fdwxact_remove record;
+        XLogRecPtr    recptr;
+
+        /* Fill up the log record before releasing the entry */
+        record.serverid = fdwxact->serverid;
+        record.dbid = fdwxact->dbid;
+        record.xid = fdwxact->local_xid;
+        record.userid = fdwxact->userid;
+
+        /*
+         * Now writing FdwXact state data to WAL. We have to set delayChkpt
+         * here, otherwise a checkpoint starting immediately after the
+         * WAL record is inserted could complete without fsync'ing our
+         * state file.  (This is essentially the same kind of race condition
+         * as the COMMIT-to-clog-write case that RecordTransactionCommit
+         * uses delayChkpt for; see notes there.)
+         */
+        START_CRIT_SECTION();
+
+        MyPgXact->delayChkpt = true;
+
+        /*
+         * Log that we are removing the foreign transaction entry and
+         * remove the file from the disk as well.
+         */
+        XLogBeginInsert();
+        XLogRegisterData((char *) &record, sizeof(xl_fdwxact_remove));
+        recptr = XLogInsert(RM_FDWXACT_ID, XLOG_FDWXACT_REMOVE);
+        XLogFlush(recptr);
+
+        /*
+         * Now we can mark ourselves as out of the commit critical section: a
+         * checkpoint starting after this will certainly see the gxact as a
+         * candidate for fsyncing.
+         */
+        MyPgXact->delayChkpt = false;
+
+        END_CRIT_SECTION();
+    }
+}
+
+/*
+ * Return true and set FdwXactAtomicCommitReady to true if the current transaction
+ * modified data on two or more servers in FdwXactParticipants and
+ * local server itself.
+ */
+static bool
+is_foreign_twophase_commit_required(void)
+{
+    ListCell*    lc;
+    int            nserverswritten = 0;
+
+    if (!IsForeignTwophaseCommitRequested())
+        return false;
+
+    foreach(lc, FdwXactParticipants)
+    {
+        FdwXactParticipant *fdw_part = (FdwXactParticipant *) lfirst(lc);
+
+        if (fdw_part->modified)
+            nserverswritten++;
+    }
+
+    if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+        ++nserverswritten;
+
+    /*
+     * Atomic commit is required if we modified data on two or more
+     * participants.
+     */
+    if (nserverswritten <= 1)
+        return false;
+
+    ForeignTwophaseCommitIsRequired = true;
+    return true;
+}
+
+bool
+FdwXactIsForeignTwophaseCommitRequired(void)
+{
+    return ForeignTwophaseCommitIsRequired;
+}
+
+/*
+ * Compute the oldest xmin across all unresolved foreign transactions
+ * and store it in the ProcArray.
+ */
+static void
+FdwXactComputeRequiredXmin(void)
+{
+    int    i;
+    TransactionId agg_xmin = InvalidTransactionId;
+
+    Assert(FdwXactCtl != NULL);
+
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+
+        if (!fdwxact->valid)
+            continue;
+
+        Assert(TransactionIdIsValid(fdwxact->local_xid));
+
+        if (!TransactionIdIsValid(agg_xmin) ||
+            TransactionIdPrecedes(fdwxact->local_xid, agg_xmin))
+            agg_xmin = fdwxact->local_xid;
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    ProcArraySetFdwXactUnresolvedXmin(agg_xmin);
+}
+
+/*
+ * Mark my foreign transaction participants as in-doubt and clear
+ * the FdwXactParticipants list.
+ *
+ * If we leave any foreign transaction, update the oldest xmin of unresolved
+ * transaction so that local transaction id of in-doubt transaction is not
+ * truncated.
+ */
+static void
+ForgetAllFdwXactParticipants(void)
+{
+    ListCell *cell;
+    int        n_lefts = 0;
+
+    if (FdwXactParticipants == NIL)
+        return;
+
+    foreach(cell, FdwXactParticipants)
+    {
+        FdwXactParticipant    *fdw_part = (FdwXactParticipant *) lfirst(cell);
+        FdwXact fdwxact = fdw_part->fdwxact;
+
+        /* Nothing to do if didn't register FdwXact entry yet */
+        if (!fdw_part->fdwxact)
+            continue;
+
+        /*
+         * There is a race condition; the FdwXact entries in FdwXactParticipants
+         * could be used by other backend before we forget in case where the
+         * resolver process removes the FdwXact entry and other backend reuses
+         * it before we forget. So we need to check if the entries are still
+         * associated with the transaction.
+         */
+        SpinLockAcquire(&fdwxact->mutex);
+        if (fdwxact->held_by == MyBackendId)
+        {
+            fdwxact->held_by = InvalidBackendId;
+            fdwxact->indoubt = true;
+            n_lefts++;
+        }
+        SpinLockRelease(&fdwxact->mutex);
+    }
+
+    /*
+     * If we left any FdwXact entries, update the oldest local transaction of
+     * unresolved distributed transaction and take over them to the foreign
+     * transaction resolver.
+     */
+    if (n_lefts > 0)
+    {
+        elog(DEBUG1, "left %u foreign transactions in in-doubt status", n_lefts);
+        FdwXactComputeRequiredXmin();
+    }
+
+    FdwXactParticipants = NIL;
+}
+
+/*
+ * When the process exits, forget all the entries.
+ */
+static void
+AtProcExit_FdwXact(int code, Datum arg)
+{
+    ForgetAllFdwXactParticipants();
+}
+
+void
+FdwXactCleanupAtProcExit(void)
+{
+    if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+    {
+        LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+        SHMQueueDelete(&(MyProc->fdwXactLinks));
+        LWLockRelease(FdwXactResolutionLock);
+    }
+}
+
+/*
+ * Wait for the foreign transaction to be resolved.
+ *
+ * Initially backends start in state FDWXACT_NOT_WAITING and then change
+ * that state to FDWXACT_WAITING before adding ourselves to the wait queue.
+ * During FdwXactResolveForeignTransaction a fdwxact resolver changes the
+ * state to FDWXACT_WAIT_COMPLETE once all foreign transactions are resolved.
+ * This backend then resets its state to FDWXACT_NOT_WAITING.
+ * If a resolver fails to resolve the waiting transaction it moves us to
+ * the retry queue.
+ *
+ * This function is inspired by SyncRepWaitForLSN.
+ */
+void
+FdwXactWaitToBeResolved(TransactionId wait_xid, bool is_commit)
+{
+    char        *new_status = NULL;
+    const char    *old_status;
+
+    Assert(FdwXactCtl != NULL);
+    Assert(TransactionIdIsValid(wait_xid));
+    Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+    Assert(MyProc->fdwXactState == FDWXACT_NOT_WAITING);
+
+    /* Quick exit if atomic commit is not requested */
+    if (!IsForeignTwophaseCommitRequested())
+        return;
+
+    /*
+     * Also, exit if the transaction itself has no foreign transaction
+     * participants.
+     */
+    if (FdwXactParticipants == NIL && wait_xid == MyPgXact->xid)
+        return;
+
+    /* Set backend status and enqueue itself to the active queue */
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+    MyProc->fdwXactState = FDWXACT_WAITING;
+    MyProc->fdwXactWaitXid = wait_xid;
+    MyProc->fdwXactNextResolutionTs = GetCurrentTransactionStopTimestamp();
+    FdwXactQueueInsert(MyProc);
+    Assert(FdwXactQueueIsOrderedByTimestamp());
+    LWLockRelease(FdwXactResolutionLock);
+
+    /* Launch a resolver process if not yet, or wake up */
+    FdwXactLaunchOrWakeupResolver();
+
+    /*
+     * Alter ps display to show waiting for foreign transaction
+     * resolution.
+     */
+    if (update_process_title)
+    {
+        int len;
+
+        old_status = get_ps_display(&len);
+        new_status = (char *) palloc(len + 31 + 1);
+        memcpy(new_status, old_status, len);
+        sprintf(new_status + len, " waiting for resolution %d", wait_xid);
+        set_ps_display(new_status, false);
+        new_status[len] = '\0';    /* truncate off "waiting ..." */
+    }
+
+    /* Wait for all foreign transactions to be resolved */
+    for (;;)
+    {
+        /* Must reset the latch before testing state */
+        ResetLatch(MyLatch);
+
+        /*
+         * Acquiring the lock is not needed, the latch ensures proper
+         * barriers. If it looks like we're done, we must really be done,
+         * because once walsender changes the state to FDWXACT_WAIT_COMPLETE,
+         * it will never update it again, so we can't be seeing a stale value
+         * in that case.
+         */
+        if (MyProc->fdwXactState == FDWXACT_WAIT_COMPLETE)
+            break;
+
+        /*
+         * If a wait for foreign transaction resolution is pending, we can
+         * neither acknowledge the commit nor raise ERROR or FATAL.  The latter
+         * would lead the client to believe that the distributed transaction
+         * aborted, which is not true: it's already committed locally. The
+         * former is no good either: the client has requested committing a
+         * distributed transaction, and is entitled to assume that a acknowledged
+         * commit is also commit on all foreign servers, which might not be
+         * true. So in this case we issue a WARNING (which some clients may
+         * be able to interpret) and shut off further output. We do NOT reset
+         * PorcDiePending, so that the process will die after the commit is
+         * cleaned up.
+         */
+        if (ProcDiePending)
+        {
+            ereport(WARNING,
+                    (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                     errmsg("canceling the wait for resolving foreign transaction and terminating connection due to
administratorcommand"),
 
+                     errdetail("The transaction has already committed locally, but might not have been committed on
theforeign server.")));
 
+            whereToSendOutput = DestNone;
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * If a query cancel interrupt arrives we just terminate the wait with
+         * a suitable warning. The foreign transactions can be orphaned but
+         * the foreign xact resolver can pick up them and tries to resolve them
+         * later.
+         */
+        if (QueryCancelPending)
+        {
+            QueryCancelPending = false;
+            ereport(WARNING,
+                    (errmsg("canceling wait for resolving foreign transaction due to user request"),
+                     errdetail("The transaction has already committed locally, but might not have been committed on
theforeign server.")));
 
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * If the postmaster dies, we'll probably never get an
+         * acknowledgement, because all the wal sender processes will exit. So
+         * just bail out.
+         */
+        if (!PostmasterIsAlive())
+        {
+            ProcDiePending = true;
+            whereToSendOutput = DestNone;
+            FdwXactCancelWait();
+            break;
+        }
+
+        /*
+         * Wait on latch.  Any condition that should wake us up will set the
+         * latch, so no need for timeout.
+         */
+        WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+                  WAIT_EVENT_FDWXACT_RESOLUTION);
+    }
+
+    pg_read_barrier();
+
+    Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+
+    if (new_status)
+    {
+        set_ps_display(new_status, false);
+        pfree(new_status);
+    }
+}
+
+/*
+ * Return true if there are at least one backend in the wait queue. The caller
+ * must hold FdwXactResolutionLock.
+ */
+bool
+FdwXactWaiterExists(Oid dbid)
+{
+    PGPROC *proc;
+
+    Assert(LWLockHeldByMeInMode(FdwXactResolutionLock, LW_SHARED));
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->databaseId == dbid)
+            return true;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    return false;
+}
+
+/*
+ * Insert the waiter to the wait queue in fdwXactNextResolutoinTs order.
+ */
+static void
+FdwXactQueueInsert(PGPROC *waiter)
+{
+    PGPROC *proc;
+
+    Assert(LWLockHeldByMeInMode(FdwXactResolutionLock, LW_EXCLUSIVE));
+
+    proc = (PGPROC *) SHMQueuePrev(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->fdwXactNextResolutionTs < waiter->fdwXactNextResolutionTs)
+            break;
+
+        proc = (PGPROC *) SHMQueuePrev(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    if (proc)
+        SHMQueueInsertAfter(&(proc->fdwXactLinks), &(waiter->fdwXactLinks));
+    else
+        SHMQueueInsertAfter(&(FdwXactRslvCtl->fdwxact_queue), &(waiter->fdwXactLinks));
+}
+
+#ifdef USE_ASSERT_CHECKING
+static bool
+FdwXactQueueIsOrderedByTimestamp(void)
+{
+    PGPROC *proc;
+    TimestampTz lastTs;
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+    lastTs = 0;
+
+    while (proc)
+    {
+
+        if (proc->fdwXactNextResolutionTs < lastTs)
+            return false;
+
+        lastTs = proc->fdwXactNextResolutionTs;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    return true;
+}
+#endif
+
+/*
+ * Acquire FdwXactResolutionLock and cancel any wait currently in progress.
+ */
+static void
+FdwXactCancelWait(void)
+{
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+    if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+        SHMQueueDelete(&(MyProc->fdwXactLinks));
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * AtEOXact_FdwXacts
+ */
+extern void
+AtEOXact_FdwXacts(bool is_commit)
+{
+    ListCell   *lcell;
+
+    if (!is_commit)
+    {
+        foreach (lcell, FdwXactParticipants)
+        {
+            FdwXactParticipant    *fdw_part = lfirst(lcell);
+
+            /*
+             * If the foreign transaction has FdwXact entry we might have
+             * prepared it. Skip already-prepared foreign transaction because
+             * it has closed its transaction. But we are not sure that foreign
+             * transaction with status == FDWXACT_STATUS_PREPARING has been
+             * prepared or not. So we call the rollback API to close its
+             * transaction for safety. The prepared foreign transaction that
+             * we might have will be resolved by the foreign transaction
+             * resolver.
+             */
+            if (fdw_part->fdwxact)
+            {
+                bool is_prepared;
+
+                LWLockAcquire(FdwXactLock, LW_SHARED);
+                is_prepared = fdw_part->fdwxact &&
+                    fdw_part->fdwxact->status == FDWXACT_STATUS_PREPARED;
+                LWLockRelease(FdwXactLock);
+
+                if (is_prepared)
+                    continue;
+            }
+
+            /* One-phase rollback foreign transaction */
+            FdwXactOnePhaseEndForeignTransaction(fdw_part, false);
+        }
+    }
+
+    /*
+     * In commit cases, we have already prepared foreign transactions during
+     * pre-commit phase. And these prepared transactions will be resolved by
+     * the resolver process.
+     */
+
+    ForgetAllFdwXactParticipants();
+    ForeignTwophaseCommitIsRequired = false;
+}
+
+/*
+ * Prepare foreign transactions.
+ *
+ * Note that it's possible that the transaction aborts after we prepared some
+ * of participants. In this case we change to rollback and rollback all foreign
+ * transactions.
+ */
+void
+AtPrepare_FdwXacts(void)
+{
+    if (FdwXactParticipants == NIL)
+        return;
+
+    /* Check for an invalid condition */
+    if (!IsForeignTwophaseCommitRequested())
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot PREPARE a distributed transaction when foreign_twophase_commit is \'disabled\'")));
+
+    /*
+     * We cannot prepare if any foreign server of participants isn't capable
+     * of two-phase commit.
+     */
+    if (is_foreign_twophase_commit_required() &&
+        (MyXactFlags & XACT_FLAGS_FDWNOPREPARE) != 0)
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot prepare the transaction because some foreign servers involved in transaction can not
preparethe transaction")));
 
+
+    /* Prepare transactions on participating foreign servers. */
+    FdwXactPrepareForeignTransactions();
+
+    FdwXactParticipants = NIL;
+}
+
+/*
+ * Return one backend that connects to my database and is waiting for
+ * resolution.
+ */
+PGPROC *
+FdwXactGetWaiter(TimestampTz *nextResolutionTs_p, TransactionId *waitXid_p)
+{
+    PGPROC *proc;
+
+    LWLockAcquire(FdwXactResolutionLock, LW_SHARED);
+    Assert(FdwXactQueueIsOrderedByTimestamp());
+
+    proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                   &(FdwXactRslvCtl->fdwxact_queue),
+                                   offsetof(PGPROC, fdwXactLinks));
+
+    while (proc)
+    {
+        if (proc->databaseId == MyDatabaseId)
+            break;
+
+        proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->fdwxact_queue),
+                                       &(proc->fdwXactLinks),
+                                       offsetof(PGPROC, fdwXactLinks));
+    }
+
+    if (proc)
+    {
+        *nextResolutionTs_p = proc->fdwXactNextResolutionTs;
+        *waitXid_p = proc->fdwXactWaitXid;
+    }
+    else
+    {
+        *nextResolutionTs_p = -1;
+        *waitXid_p = InvalidTransactionId;
+    }
+
+    LWLockRelease(FdwXactResolutionLock);
+
+    return proc;
+}
+
+/*
+ * Get one FdwXact entry to resolve. This function intended to be used when
+ * a resolver process get FdwXact entries to resolve. So we search entries
+ * while not including in-doubt transactions and in-progress transactions.
+ */
+static FdwXact
+get_fdwxact_to_resolve(Oid dbid, TransactionId xid)
+{
+    List *fdwxacts = NIL;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* Don't include both in-doubt transactions and in-progress transactions */
+    fdwxacts = get_fdwxacts(dbid, xid, InvalidOid, InvalidOid,
+                            false, false, false);
+
+    return fdwxacts == NIL ? NULL : (FdwXact) linitial(fdwxacts);
+}
+
+/*
+ * Resolve one distributed transaction on the given database . The target
+ * distributed transaction is fetched from the waiting queue and its transaction
+ * participants are fetched from the global array.
+ *
+ * Release the waiter and return true after we resolved the all of the foreign
+ * transaction participants. On failure, we re-enqueue the waiting backend after
+ * incremented the next resolution time.
+ */
+void
+FdwXactResolveTransactionAndReleaseWaiter(Oid dbid, TransactionId xid,
+                                          PGPROC *waiter)
+{
+    FdwXact    fdwxact;
+
+    Assert(TransactionIdIsValid(xid));
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    while ((fdwxact = get_fdwxact_to_resolve(MyDatabaseId, xid)) != NULL)
+    {
+        FdwXactRslvState *state;
+        ForeignServer *server;
+        UserMapping    *usermapping;
+
+        CHECK_FOR_INTERRUPTS();
+
+        server = GetForeignServer(fdwxact->serverid);
+        usermapping = GetUserMapping(fdwxact->userid, fdwxact->serverid);
+
+        state = create_fdwxact_state();
+        SpinLockAcquire(&fdwxact->mutex);
+        state->server = server;
+        state->usermapping = usermapping;
+        state->fdwxact_id = pstrdup(fdwxact->fdwxact_id);
+        SpinLockRelease(&fdwxact->mutex);
+
+        FdwXactDetermineTransactionFate(fdwxact, false);
+
+        /* Do not hold during foreign transaction resolution */
+        LWLockRelease(FdwXactLock);
+
+        PG_TRY();
+        {
+            /*
+             * Resolve the foreign transaction. When committing or aborting
+             * prepared foreign transactions the previous status is always
+             * FDWXACT_STATUS_PREPARED.
+             */
+            FdwXactResolveForeignTransaction(fdwxact, state,
+                                             FDWXACT_STATUS_PREPARED);
+        }
+        PG_CATCH();
+        {
+            /*
+             * Failed to resolve. Re-insert the waiter to the tail of retry
+             * queue if the waiter is still waiting.
+             */
+            LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+            if (waiter->fdwXactState == FDWXACT_WAITING)
+            {
+                SHMQueueDelete(&(waiter->fdwXactLinks));
+                pg_write_barrier();
+                waiter->fdwXactNextResolutionTs =
+                    TimestampTzPlusMilliseconds(waiter->fdwXactNextResolutionTs,
+                                                foreign_xact_resolution_retry_interval);
+                FdwXactQueueInsert(waiter);
+            }
+            LWLockRelease(FdwXactResolutionLock);
+
+            PG_RE_THROW();
+        }
+        PG_END_TRY();
+
+        elog(DEBUG2, "resolved one foreign transaction xid %u, serverid %d, userid %d",
+             fdwxact->local_xid, fdwxact->serverid, fdwxact->userid);
+
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    LWLockAcquire(FdwXactResolutionLock, LW_EXCLUSIVE);
+
+    /*
+     * Remove waiter from shmem queue, if not detached yet. The waiter
+     * could already be detached if user cancelled to wait before
+     * resolution.
+     */
+    if (!SHMQueueIsDetached(&(waiter->fdwXactLinks)))
+    {
+        TransactionId    wait_xid = waiter->fdwXactWaitXid;
+
+        SHMQueueDelete(&(waiter->fdwXactLinks));
+        pg_write_barrier();
+
+        /* Set state to complete */
+        waiter->fdwXactState = FDWXACT_WAIT_COMPLETE;
+
+        /* Wake up the waiter only when we have set state and removed from queue */
+        SetLatch(&(waiter->procLatch));
+
+        elog(DEBUG2, "released the proc with xid %u", wait_xid);
+    }
+    else
+        elog(DEBUG2, "the waiter backend had been already detached");
+
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * Determine whether the given foreign transaction should be committed or
+ * rolled back according to the result of the local transaction. This function
+ * changes fdwxact->status so the caller must hold FdwXactLock in exclusive
+ * mode or passing need_lock with true.
+ */
+static void
+FdwXactDetermineTransactionFate(FdwXact fdwxact, bool need_lock)
+{
+    if (need_lock)
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    /*
+     * The being resolved transaction must be either that has been cancelled
+     *  and marked as in-doubt or that has been prepared.
+     */
+    Assert(fdwxact->indoubt ||
+           fdwxact->status == FDWXACT_STATUS_PREPARED);
+
+    /*
+     * If the local transaction is already committed, commit prepared
+     * foreign transaction.
+     */
+    if (TransactionIdDidCommit(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_COMMITTING;
+
+    /*
+     * If the local transaction is already aborted, abort prepared
+     * foreign transactions.
+     */
+    else if (TransactionIdDidAbort(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_ABORTING;
+
+
+    /*
+     * The local transaction is not in progress but the foreign
+     * transaction is not prepared on the foreign server. This
+     * can happen when transaction failed after registered this
+     * entry but before actual preparing on the foreign server.
+     * So let's assume it aborted.
+     */
+    else if (!TransactionIdIsInProgress(fdwxact->local_xid))
+        fdwxact->status = FDWXACT_STATUS_ABORTING;
+
+    /*
+     * The Local transaction is in progress and foreign transaction is
+     * about to be committed or aborted. This should not happen except for one
+     * case where the local transaction is prepared and this foreign transaction
+     * is being resolved manually using by pg_resolve_foreign_xact(). Raise an
+     * error anyway since we cannot determine the fate of this foreign
+     * transaction according to the local transaction whose fate is also not
+     * determined.
+     */
+    else
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("cannot resolve the foreign transaction associated with in-progress transaction %u on server
%u",
+                        fdwxact->local_xid, fdwxact->serverid),
+                 errhint("The local transaction with xid %u might be prepared",
+                         fdwxact->local_xid)));
+
+    if (need_lock)
+        LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * callback function. The 'state' is passed to the callback function. The fate of
+ * foreign transaction must be determined. If foreign transaction is resolved
+ * successfully, remove the FdwXact entry from the shared memory and also
+ * remove the corresponding on-disk file. If failed, the status of FdwXact
+ * entry changes to 'fallback_status' before erroring out.
+ */
+static void
+FdwXactResolveForeignTransaction(FdwXact fdwxact, FdwXactRslvState *state,
+                                 FdwXactStatus fallback_status)
+{
+    ForeignServer        *server;
+    ForeignDataWrapper    *fdw;
+    FdwRoutine            *fdw_routine;
+    bool                is_commit;
+
+    Assert(state != NULL);
+    Assert(state->server && state->usermapping && state->fdwxact_id);
+    Assert(fdwxact != NULL);
+
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    if (fdwxact->status != FDWXACT_STATUS_COMMITTING &&
+        fdwxact->status != FDWXACT_STATUS_ABORTING)
+        elog(ERROR, "cannot resolve foreign transaction whose fate is not determined");
+
+    is_commit = fdwxact->status == FDWXACT_STATUS_COMMITTING;
+    LWLockRelease(FdwXactLock);
+
+    server = GetForeignServer(fdwxact->serverid);
+    fdw = GetForeignDataWrapper(server->fdwid);
+    fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+
+    PG_TRY();
+    {
+        if (is_commit)
+            fdw_routine->CommitForeignTransaction(state);
+        else
+            fdw_routine->RollbackForeignTransaction(state);
+    }
+    PG_CATCH();
+    {
+        /* Back to the fallback status */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        fdwxact->status = fallback_status;
+        LWLockRelease(FdwXactLock);
+
+        PG_RE_THROW();
+    }
+    PG_END_TRY();
+
+    /* Resolution was a success, remove the entry */
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    elog(DEBUG1, "successfully %s the foreign transaction with xid %u db %u server %u user %u",
+         is_commit ? "committed" : "rolled back",
+         fdwxact->local_xid, fdwxact->dbid, fdwxact->serverid,
+         fdwxact->userid);
+
+    fdwxact->status = FDWXACT_STATUS_RESOLVED;
+    if (fdwxact->ondisk)
+        RemoveFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                          fdwxact->serverid, fdwxact->userid,
+                          true);
+    remove_fdwxact(fdwxact);
+    LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Return palloc'd and initialized FdwXactRslvState.
+ */
+static FdwXactRslvState *
+create_fdwxact_state(void)
+{
+    FdwXactRslvState *state;
+
+    state = palloc(sizeof(FdwXactRslvState));
+    state->server = NULL;
+    state->usermapping = NULL;
+    state->fdwxact_id = NULL;
+    state->flags = 0;
+
+    return state;
+}
+
+/*
+ * Return at least one FdwXact entry that matches to given argument,
+ * otherwise return NULL. All arguments must be valid values so that it can
+ * search exactly one (or none) entry. Note that this function intended to be
+ * used for modifying the returned FdwXact entry, so the caller must hold
+ * FdwXactLock in exclusive mode and it doesn't include the in-progress
+ * FdwXact entries.
+ */
+static FdwXact
+get_one_fdwxact(Oid dbid, TransactionId xid, Oid serverid, Oid userid)
+{
+    List    *fdwxact_list;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    /* All search conditions must be valid values */
+    Assert(TransactionIdIsValid(xid));
+    Assert(OidIsValid(serverid));
+    Assert(OidIsValid(userid));
+    Assert(OidIsValid(dbid));
+
+    /* Include in-dbout transactions but don't include in-progress ones */
+    fdwxact_list = get_fdwxacts(dbid, xid, serverid, userid,
+                                true, false, false);
+
+    /* Must be one entry since we search it by the unique key */
+    Assert(list_length(fdwxact_list) <= 1);
+
+    /* Could not find entry */
+    if (fdwxact_list == NIL)
+        return NULL;
+
+    return (FdwXact) linitial(fdwxact_list);
+}
+
+/*
+ * Return true if there is at least one prepared foreign transaction
+ * which matches given arguments.
+ */
+bool
+fdwxact_exists(Oid dbid, Oid serverid, Oid userid)
+{
+    List    *fdwxact_list;
+
+    /* Find entries from all FdwXact entries */
+    fdwxact_list = get_fdwxacts(dbid, InvalidTransactionId, serverid,
+                                userid, true, true, true);
+
+    return fdwxact_list != NIL;
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_foreign_xacts, and the number of entries to num_p.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't
+ * want them.
+ *
+ * The returned array is palloc'd.
+ */
+static FdwXact
+get_all_fdwxacts(int *num_p)
+{
+    List        *all_fdwxacts;
+    ListCell    *lc;
+    FdwXact        fdwxacts;
+    int            num_fdwxacts = 0;
+
+    Assert(num_p != NULL);
+
+    /* Get all entries */
+    all_fdwxacts = get_fdwxacts(InvalidOid, InvalidTransactionId,
+                                InvalidOid, InvalidOid, true,
+                                true, true);
+
+    if (all_fdwxacts == NIL)
+    {
+        *num_p = 0;
+        return NULL;
+    }
+
+    fdwxacts = (FdwXact)
+        palloc(sizeof(FdwXactData) * list_length(all_fdwxacts));
+    *num_p = list_length(all_fdwxacts);
+
+    /* Convert list to array of FdwXact */
+    foreach(lc, all_fdwxacts)
+    {
+        FdwXact fx = (FdwXact) lfirst(lc);
+
+        memcpy(fdwxacts + num_fdwxacts, fx,
+               sizeof(FdwXactData));
+        num_fdwxacts++;
+    }
+
+    list_free(all_fdwxacts);
+
+    return fdwxacts;
+}
+
+/*
+ * Return a list of FdwXact matched to given arguments. Otherwise return NIL.
+ * The search condition is defined by arguments with valid values for
+ * respective datatypes. 'include_indoubt' and 'include_in_progress' are the
+ * option for that the result includes in-doubt transactions and in-progress
+ * transactions respecitively.
+ */
+static List*
+get_fdwxacts(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+             bool include_indoubt, bool include_in_progress, bool need_lock)
+{
+    int i;
+    List    *fdwxact_list = NIL;
+
+    if (need_lock)
+        LWLockAcquire(FdwXactLock, LW_SHARED);
+
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact    fdwxact = FdwXactCtl->fdwxacts[i];
+
+        /* dbid */
+        if (OidIsValid(dbid) && fdwxact->dbid != dbid)
+            continue;
+
+        /* xid */
+        if (TransactionIdIsValid(xid) && xid != fdwxact->local_xid)
+            continue;
+
+        /* serverid */
+        if (OidIsValid(serverid) && serverid != fdwxact->serverid)
+            continue;
+
+        /* userid */
+        if (OidIsValid(userid) && fdwxact->userid != userid)
+            continue;
+
+        /* include in-doubt transaction? */
+        if (!include_indoubt && fdwxact->indoubt)
+            continue;
+
+        /* include in-progress transaction? */
+        if (!include_in_progress && FdwXactIsBeingResolved(fdwxact))
+            continue;
+
+        /* Append it if matched */
+        fdwxact_list = lappend(fdwxact_list, fdwxact);
+    }
+
+    if (need_lock)
+        LWLockRelease(FdwXactLock);
+
+    return fdwxact_list;
+}
+
+/* Apply the redo log for a foreign transaction */
+void
+fdwxact_redo(XLogReaderState *record)
+{
+    char       *rec = XLogRecGetData(record);
+    uint8        info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+    if (info == XLOG_FDWXACT_INSERT)
+    {
+        /*
+         * Add fdwxact entry and set start/end lsn of the WAL record
+         * in FdwXact entry.
+         */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        FdwXactRedoAdd(XLogRecGetData(record),
+                       record->ReadRecPtr,
+                       record->EndRecPtr);
+        LWLockRelease(FdwXactLock);
+    }
+    else if (info == XLOG_FDWXACT_REMOVE)
+    {
+        xl_fdwxact_remove *record = (xl_fdwxact_remove *) rec;
+
+        /* Delete FdwXact entry and file if exists */
+        LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+        FdwXactRedoRemove(record->dbid, record->xid, record->serverid,
+                          record->userid, false);
+        LWLockRelease(FdwXactLock);
+    }
+    else
+        elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+    return;
+}
+
+/*
+ * Return a null-terminated foreign transaction identifier. If the given
+ * foreign server's FDW provides getPrepareId callback we return the identifier
+ * returned from it. Otherwise we generate an unique identifier with in the
+ * form of "fx_<random number>_<xid>_<serverid>_<userid> whose length is
+ * less than FDWXACT_ID_MAX_LEN.
+ *
+ * Returned string value is used to identify foreign transaction. The
+ * identifier should not be same as any other concurrent prepared transaction
+ * identifier.
+ *
+ * To make the foreign transactionid unique, we should ideally use something
+ * like UUID, which gives unique ids with high probability, but that may be
+ * expensive here and UUID extension which provides the function to generate
+ * UUID is not part of the core code.
+ */
+static char *
+get_fdwxact_identifier(FdwXactParticipant *fdw_part, TransactionId xid)
+{
+    char    *id;
+    int        id_len = 0;
+
+    if (!fdw_part->get_prepareid_fn)
+    {
+        char buf[FDWXACT_ID_MAX_LEN] = {0};
+
+        /*
+         * FDW doesn't provide the callback function, generate an unique
+         * idenetifier.
+         */
+        snprintf(buf, FDWXACT_ID_MAX_LEN, "fx_%ld_%u_%d_%d",
+             Abs(random()), xid, fdw_part->server->serverid,
+                 fdw_part->usermapping->userid);
+
+        return pstrdup(buf);
+    }
+
+    /* Get an unique identifier from callback function */
+    id = fdw_part->get_prepareid_fn(xid, fdw_part->server->serverid,
+                                    fdw_part->usermapping->userid,
+                                    &id_len);
+
+    if (id == NULL)
+        ereport(ERROR,
+                (errcode(ERRCODE_UNDEFINED_OBJECT),
+                 (errmsg("foreign transaction identifier is not provided"))));
+
+    /* Check length of foreign transaction identifier */
+    if (id_len > FDWXACT_ID_MAX_LEN)
+    {
+        id[FDWXACT_ID_MAX_LEN] = '\0';
+        ereport(ERROR,
+                (errcode(ERRCODE_NAME_TOO_LONG),
+                 errmsg("foreign transaction identifer \"%s\" is too long",
+                        id),
+                 errdetail("foreign transaction identifier must be less than %d characters.",
+                           FDWXACT_ID_MAX_LEN)));
+    }
+
+    id[id_len] = '\0';
+    return pstrdup(id);
+}
+
+/*
+ * We must fsync the foreign transaction state file that is valid or generated
+ * during redo and has a inserted LSN <= the checkpoint'S redo horizon.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ *
+ * This is deliberately run as late as possible in the checkpoint sequence,
+ * because FdwXacts ordinarily have short lifespans, and so it is quite
+ * possible that FdwXacts that were valid at checkpoint start will no longer
+ * exist if we wait a little bit. With typical checkpoint settings this
+ * will be about 3 minutes for an online checkpoint, so as a result we
+ * expect that there will be no FdwXacts that need to be copied to disk.
+ *
+ * If a FdwXact remains valid across multiple checkpoints, it will already
+ * be on disk so we don't bother to repeat that write.
+ */
+void
+CheckPointFdwXacts(XLogRecPtr redo_horizon)
+{
+    int            cnt;
+    int            serialized_fdwxacts = 0;
+
+    if (max_prepared_foreign_xacts <= 0)
+        return;                        /* nothing to do */
+
+    TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+    /*
+     * We are expecting there to be zero FdwXact that need to be copied to
+     * disk, so we perform all I/O while holding FdwXactLock for simplicity.
+     * This presents any new foreign xacts from preparing while this occurs,
+     * which shouldn't be a problem since the presence fo long-lived prepared
+     * foreign xacts indicated the transaction manager isn't active.
+     *
+     * It's also possible to move I/O out of the lock, but on every error we
+     * should check whether somebody committed our transaction in different
+     * backend. Let's leave this optimisation for future, if somebody will
+     * spot that this place cause bottleneck.
+     *
+     * Note that it isn't possible for there to be a FdwXact with a
+     * insert_end_lsn set prior to the last checkpoint yet is marked
+     * invalid, because of the efforts with delayChkpt.
+     */
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+    for (cnt = 0; cnt < FdwXactCtl->num_fdwxacts; cnt++)
+    {
+        FdwXact        fdwxact = FdwXactCtl->fdwxacts[cnt];
+
+        if ((fdwxact->valid || fdwxact->inredo) &&
+            !fdwxact->ondisk &&
+            fdwxact->insert_end_lsn <= redo_horizon)
+        {
+            char       *buf;
+            int            len;
+
+            XlogReadFdwXactData(fdwxact->insert_start_lsn, &buf, &len);
+            RecreateFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                                fdwxact->serverid, fdwxact->userid,
+                                buf, len);
+            fdwxact->ondisk = true;
+            fdwxact->insert_start_lsn = InvalidXLogRecPtr;
+            fdwxact->insert_end_lsn = InvalidXLogRecPtr;
+            pfree(buf);
+            serialized_fdwxacts++;
+        }
+    }
+
+    LWLockRelease(FdwXactLock);
+
+    /*
+     * Flush unconditionally the parent directory to make any information
+     * durable on disk.  FdwXact files could have been removed and those
+     * removals need to be made persistent as well as any files newly created.
+     */
+    fsync_fname(FDWXACTS_DIR, true);
+
+    TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+    if (log_checkpoints && serialized_fdwxacts > 0)
+        ereport(LOG,
+              (errmsg_plural("%u foreign transaction state file was written "
+                             "for long-running prepared transactions",
+                             "%u foreign transaction state files were written "
+                             "for long-running prepared transactions",
+                             serialized_fdwxacts,
+                             serialized_fdwxacts)));
+}
+
+/*
+ * Reads foreign transaction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFdwXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+    XLogRecord *record;
+    XLogReaderState *xlogreader;
+    char       *errormsg;
+
+    xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+                                    &read_local_xlog_page, NULL);
+    if (!xlogreader)
+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("out of memory"),
+           errdetail("Failed while allocating an XLog reading processor.")));
+
+    record = XLogReadRecord(xlogreader, lsn, &errormsg);
+    if (record == NULL)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+        errmsg("could not read foreign transaction state from xlog at %X/%X",
+               (uint32) (lsn >> 32),
+               (uint32) lsn)));
+
+    if (XLogRecGetRmid(xlogreader) != RM_FDWXACT_ID ||
+        (XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDWXACT_INSERT)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+                        (uint32) (lsn >> 32),
+                        (uint32) lsn)));
+
+    if (len != NULL)
+        *len = XLogRecGetDataLen(xlogreader);
+
+    *buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+    memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+    XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay
+ * and during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFdwXactFile(Oid dbid, TransactionId xid, Oid serverid,
+                    Oid userid, void *content, int len)
+{
+    char        path[MAXPGPATH];
+    pg_crc32c    statefile_crc;
+    int            fd;
+
+    /* Recompute CRC */
+    INIT_CRC32C(statefile_crc);
+    COMP_CRC32C(statefile_crc, content, len);
+    FIN_CRC32C(statefile_crc);
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+
+    fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+
+    if (fd < 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+        errmsg("could not recreate foreign transaction state file \"%s\": %m",
+               path)));
+
+    /* Write content and CRC */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_WRITE);
+    if (write(fd, content, len) != len)
+    {
+        /* if write didn't set errno, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not write foreign transcation state file: %m")));
+    }
+    if (write(fd, &statefile_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+    {
+        if (errno == 0)
+            errno = ENOSPC;
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not write foreign transcation state file: %m")));
+    }
+    pgstat_report_wait_end();
+
+    /*
+     * We must fsync the file because the end-of-replay checkpoint will not do
+     * so, there being no FDWXACT in shared memory yet to tell it to.
+     */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_SYNC);
+    if (pg_fsync(fd) != 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+              errmsg("could not fsync foreign transaction state file: %m")));
+    pgstat_report_wait_end();
+
+    if (CloseTransientFile(fd) != 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not close foreign transaction file: %m")));
+}
+
+/*
+ * Given a transaction id, userid and serverid read it either from disk
+ * or read it directly via shmem xlog record pointer using the provided
+ * "insert_start_lsn".
+ */
+static char *
+ProcessFdwXactBuffer(Oid dbid, TransactionId xid, Oid serverid,
+                     Oid userid, XLogRecPtr insert_start_lsn, bool fromdisk)
+{
+    TransactionId    origNextXid =
+        XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+    char    *buf;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+
+    if (!fromdisk)
+        Assert(!XLogRecPtrIsInvalid(insert_start_lsn));
+
+    /* Reject XID if too new */
+    if (TransactionIdFollowsOrEquals(xid, origNextXid))
+    {
+        if (fromdisk)
+        {
+            ereport(WARNING,
+                    (errmsg("removing future fdwxact state file for xid %u, server %u and user %u",
+                            xid, serverid, userid)));
+            RemoveFdwXactFile(dbid, xid, serverid, userid, true);
+        }
+        else
+        {
+            ereport(WARNING,
+                    (errmsg("removing future fdwxact state from memory for xid %u, server %u and user %u",
+                            xid, serverid, userid)));
+            FdwXactRedoRemove(dbid, xid, serverid, userid, true);
+        }
+        return NULL;
+    }
+
+    if (fromdisk)
+    {
+        /* Read and validate file */
+        buf = ReadFdwXactFile(dbid, xid, serverid, userid);
+    }
+    else
+    {
+        /* Read xlog data */
+        XlogReadFdwXactData(insert_start_lsn, &buf, NULL);
+    }
+
+    return buf;
+}
+
+/*
+ * Read and validate the foreign transaction state file.
+ *
+ * If it looks OK (has a valid magic number and CRC), return the palloc'd
+ * contents of the file, issuing an error when finding corrupted data.
+ * This state can be reached when doing recovery.
+ */
+static char *
+ReadFdwXactFile(Oid dbid, TransactionId xid, Oid serverid, Oid userid)
+{
+    char        path[MAXPGPATH];
+    int            fd;
+    FdwXactOnDiskData *fdwxact_file_data;
+    struct stat stat;
+    uint32        crc_offset;
+    pg_crc32c    calc_crc;
+    pg_crc32c    file_crc;
+    char       *buf;
+    int            r;
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+
+    fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+    if (fd < 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+               errmsg("could not open FDW transaction state file \"%s\": %m",
+                      path)));
+
+    /*
+     * Check file length.  We can determine a lower bound pretty easily. We
+     * set an upper bound to avoid palloc() failure on a corrupt file, though
+     * we can't guarantee that we won't get an out of memory error anyway,
+     * even on a valid file.
+     */
+    if (fstat(fd, &stat))
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not stat FDW transaction state file \"%s\": %m",
+                        path)));
+
+    if (stat.st_size < (offsetof(FdwXactOnDiskData, fdwxact_id) +
+                        sizeof(pg_crc32c)) ||
+        stat.st_size > MaxAllocSize)
+
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("too large FDW transaction state file \"%s\": %m",
+                        path)));
+
+    crc_offset = stat.st_size - sizeof(pg_crc32c);
+    if (crc_offset != MAXALIGN(crc_offset))
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("incorrect alignment of CRC offset for file \"%s\"",
+                        path)));
+
+    /*
+     * Ok, slurp in the file.
+     */
+    buf = (char *) palloc(stat.st_size);
+    fdwxact_file_data = (FdwXactOnDiskData *) buf;
+
+    /* Slurp the file */
+    pgstat_report_wait_start(WAIT_EVENT_FDWXACT_FILE_READ);
+    r = read(fd, buf, stat.st_size);
+    if (r != stat.st_size)
+    {
+        if (r < 0)
+            ereport(ERROR,
+                    (errcode_for_file_access(),
+                     errmsg("could not read file \"%s\": %m", path)));
+        else
+            ereport(ERROR,
+                    (errmsg("could not read file \"%s\": read %d of %zu",
+                            path, r, (Size) stat.st_size)));
+    }
+    pgstat_report_wait_end();
+
+    if (CloseTransientFile(fd))
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not close file \"%s\": %m", path)));
+
+    /*
+     * Check the CRC.
+     */
+    INIT_CRC32C(calc_crc);
+    COMP_CRC32C(calc_crc, buf, crc_offset);
+    FIN_CRC32C(calc_crc);
+
+    file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+    if (!EQ_CRC32C(calc_crc, file_crc))
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("calculated CRC checksum does not match value stored in file \"%s\"",
+                        path)));
+
+    /* Check if the contents is an expected data */
+    fdwxact_file_data = (FdwXactOnDiskData *) buf;
+    if (fdwxact_file_data->dbid  != dbid ||
+        fdwxact_file_data->serverid != serverid ||
+        fdwxact_file_data->userid != userid ||
+        fdwxact_file_data->local_xid != xid)
+        ereport(ERROR,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("invalid foreign transaction state file \"%s\"",
+                        path)));
+
+    return buf;
+}
+
+/*
+ * Scan the shared memory entries of FdwXact and determine the range of valid
+ * XIDs present.  This is run during database startup, after we have completed
+ * reading WAL.  ShmemVariableCache->nextFullXid has been set to one more than
+ * the highest XID for which evidence exists in WAL.
+
+ * On corrupted two-phase files, fail immediately.  Keeping around broken
+ * entries and let replay continue causes harm on the system, and a new
+ * backup should be rolled in.
+
+ * Our other responsibility is to update and return the oldest valid XID
+ * among the distributed transactions. This is needed to synchronize pg_subtrans
+ * startup properly.
+ */
+TransactionId
+PrescanFdwXacts(TransactionId oldestActiveXid)
+{
+    FullTransactionId nextFullXid = ShmemVariableCache->nextFullXid;
+    TransactionId origNextXid = XidFromFullTransactionId(nextFullXid);
+    TransactionId result = origNextXid;
+    int i;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+        char *buf;
+
+        buf = ProcessFdwXactBuffer(fdwxact->dbid, fdwxact->local_xid,
+                                   fdwxact->serverid, fdwxact->userid,
+                                   fdwxact->insert_start_lsn, fdwxact->ondisk);
+
+        if (buf == NULL)
+            continue;
+
+        if (TransactionIdPrecedes(fdwxact->local_xid, result))
+            result = fdwxact->local_xid;
+
+        pfree(buf);
+    }
+    LWLockRelease(FdwXactLock);
+
+    return result;
+}
+
+/*
+ * Scan pg_fdwxact and fill FdwXact depending on the on-disk data.
+ * This is called once at the beginning of recovery, saving any extra
+ * lookups in the future.  FdwXact files that are newer than the
+ * minimum XID horizon are discarded on the way.
+ */
+void
+restoreFdwXactData(void)
+{
+    DIR           *cldir;
+    struct dirent *clde;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    cldir = AllocateDir(FDWXACTS_DIR);
+    while ((clde = ReadDir(cldir, FDWXACTS_DIR)) != NULL)
+    {
+        if (strlen(clde->d_name) == FDWXACT_FILE_NAME_LEN &&
+            strspn(clde->d_name, "0123456789ABCDEF_") == FDWXACT_FILE_NAME_LEN)
+        {
+            TransactionId local_xid;
+            Oid            dbid;
+            Oid            serverid;
+            Oid            userid;
+            char        *buf;
+
+            sscanf(clde->d_name, "%08x_%08x_%08x_%08x",
+                   &dbid, &local_xid, &serverid, &userid);
+
+            /* Read fdwxact data from disk */
+            buf = ProcessFdwXactBuffer(dbid, local_xid, serverid, userid,
+                                       InvalidXLogRecPtr, true);
+
+            if (buf == NULL)
+                continue;
+
+            /* Add this entry into the table of foreign transactions */
+            FdwXactRedoAdd(buf, InvalidXLogRecPtr, InvalidXLogRecPtr);
+        }
+    }
+
+    LWLockRelease(FdwXactLock);
+    FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFdwXactFile(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+                  bool giveWarning)
+{
+    char        path[MAXPGPATH];
+
+    FdwXactFilePath(path, dbid, xid, serverid, userid);
+    if (unlink(path) < 0 && (errno != ENOENT || giveWarning))
+        ereport(WARNING,
+                (errcode_for_file_access(),
+                 errmsg("could not remove foreign transaction state file \"%s\": %m",
+                        path)));
+}
+
+/*
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdwxact entry in shared memory FdwXactData structure.
+ */
+static void
+FdwXactRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn)
+{
+    FdwXactOnDiskData *fdwxact_data = (FdwXactOnDiskData *) buf;
+    FdwXact fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+    Assert(RecoveryInProgress());
+
+    /*
+     * Add this entry into the table of foreign transactions. The
+     * status of the transaction is set as preparing, since we do not
+     * know the exact status right now. Resolver will set it later
+     * based on the status of local transaction which prepared this
+     * foreign transaction.
+     */
+    fdwxact = insert_fdwxact(fdwxact_data->dbid, fdwxact_data->local_xid,
+                              fdwxact_data->serverid, fdwxact_data->userid,
+                              fdwxact_data->umid, fdwxact_data->fdwxact_id);
+
+    elog(DEBUG2, "added fdwxact entry in shared memory for foreign transaction, db %u xid %u server %u user %u id
%s",
+         fdwxact_data->dbid, fdwxact_data->local_xid,
+         fdwxact_data->serverid, fdwxact_data->userid,
+         fdwxact_data->fdwxact_id);
+
+    /*
+     * Set status as PREPARED and as in-doubt, since we do not know
+     * the xact status right now. Resolver will set it later based on
+     * the status of local transaction that prepared this fdwxact entry.
+     */
+    fdwxact->status = FDWXACT_STATUS_PREPARED;
+    fdwxact->insert_start_lsn = start_lsn;
+    fdwxact->insert_end_lsn = end_lsn;
+    fdwxact->inredo = true;    /* added in redo */
+    fdwxact->indoubt = true;
+    fdwxact->valid = false;
+    fdwxact->ondisk = XLogRecPtrIsInvalid(start_lsn);
+}
+
+/*
+ * Remove the corresponding fdwxact entry from FdwXactCtl. Also remove
+ * FdwXact file if a foreign transaction was saved via an earlier checkpoint.
+ * We could not found the FdwXact entry in the case where a crash recovery
+ * starts from the point where is after added but before removed the entry.
+ */
+void
+FdwXactRedoRemove(Oid dbid, TransactionId xid, Oid serverid,
+                  Oid userid, bool givewarning)
+{
+    FdwXact    fdwxact;
+
+    Assert(LWLockHeldByMeInMode(FdwXactLock, LW_EXCLUSIVE));
+    Assert(RecoveryInProgress());
+
+    fdwxact = get_one_fdwxact(dbid, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+        return;
+
+    elog(DEBUG2, "removed fdwxact entry from shared memory for foreign transaction, db %u xid %u server %u user %u id
%s",
+         fdwxact->dbid, fdwxact->local_xid, fdwxact->serverid,
+         fdwxact->userid, fdwxact->fdwxact_id);
+
+    /* Clean up entry and any files we may have left */
+    if (fdwxact->ondisk)
+        RemoveFdwXactFile(fdwxact->dbid, fdwxact->local_xid,
+                          fdwxact->serverid, fdwxact->userid,
+                          givewarning);
+    remove_fdwxact(fdwxact);
+}
+
+/*
+ * Scan the shared memory entries of FdwXact and valid them.
+ *
+ * This is run at the end of recovery, but before we allow backends to write
+ * WAL.
+ */
+void
+RecoverFdwXacts(void)
+{
+    int i;
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+        char    *buf;
+
+        buf = ProcessFdwXactBuffer(fdwxact->dbid, fdwxact->local_xid,
+                                   fdwxact->serverid, fdwxact->userid,
+                                   fdwxact->insert_start_lsn, fdwxact->ondisk);
+
+        if (buf == NULL)
+            continue;
+
+        ereport(LOG,
+                (errmsg("recovering foreign transaction %u for server %u and user %u from shared memory",
+                        fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+
+        /* recovered, so reset the flag for entries generated by redo */
+        fdwxact->inredo = false;
+        fdwxact->valid = true;
+
+        /*
+         * If the foreign transaction is part of the prepared local
+         * transaction, it's not in in-doubt. The future COMMIT/ROLLBACK
+         * PREPARED can determine the fate of this foreign transaction.
+         */
+        if (TwoPhaseExists(fdwxact->local_xid))
+        {
+            ereport(DEBUG2,
+                    (errmsg("clear in-doubt flag from foreign transaction %u, server %u, user %u as found the
correspondinglocal prepared transaction",
 
+                            fdwxact->local_xid, fdwxact->serverid,
+                            fdwxact->userid)));
+            fdwxact->indoubt = false;
+        }
+
+        pfree(buf);
+    }
+    LWLockRelease(FdwXactLock);
+}
+
+bool
+check_foreign_twophase_commit(int *newval, void **extra, GucSource source)
+{
+    ForeignTwophaseCommitLevel newForeignTwophaseCommitLevel = *newval;
+
+    /* Parameter check */
+    if (newForeignTwophaseCommitLevel > FOREIGN_TWOPHASE_COMMIT_DISABLED &&
+        (max_prepared_foreign_xacts == 0 || max_foreign_xact_resolvers == 0))
+    {
+        GUC_check_errdetail("Cannot enable \"foreign_twophase_commit\" when "
+                            "\"max_prepared_foreign_transactions\" or \"max_foreign_transaction_resolvers\""
+                            "is zero value");
+        return false;
+    }
+
+    return true;
+}
+
+/* Built in functions */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+    FdwXact        fdwxacts;
+    int            num_xacts;
+    int            cur_xact;
+}    WorkingStatus;
+
+Datum
+pg_foreign_xacts(PG_FUNCTION_ARGS)
+{
+#define PG_PREPARED_FDWXACTS_COLS    7
+    FuncCallContext *funcctx;
+    WorkingStatus *status;
+    char       *xact_status;
+
+    if (SRF_IS_FIRSTCALL())
+    {
+        TupleDesc    tupdesc;
+        MemoryContext oldcontext;
+        int            num_fdwxacts = 0;
+
+        /* create a function context for cross-call persistence */
+        funcctx = SRF_FIRSTCALL_INIT();
+
+        /*
+         * Switch to memory context appropriate for multiple function calls
+         */
+        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+        /* build tupdesc for result tuples */
+        /* this had better match pg_fdwxacts view in system_views.sql */
+        tupdesc = CreateTemplateTupleDesc(PG_PREPARED_FDWXACTS_COLS);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+                           XIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+                           OIDOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+                           TEXTOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 6, "indoubt",
+                           BOOLOID, -1, 0);
+        TupleDescInitEntry(tupdesc, (AttrNumber) 7, "identifier",
+                           TEXTOID, -1, 0);
+
+        funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+        /*
+         * Collect status information that we will format and send out as a
+         * result set.
+         */
+        status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+        funcctx->user_fctx = (void *) status;
+
+        status->fdwxacts = get_all_fdwxacts(&num_fdwxacts);
+        status->num_xacts = num_fdwxacts;
+        status->cur_xact = 0;
+
+        MemoryContextSwitchTo(oldcontext);
+    }
+
+    funcctx = SRF_PERCALL_SETUP();
+    status = funcctx->user_fctx;
+
+    while (status->cur_xact < status->num_xacts)
+    {
+        FdwXact        fdwxact = &status->fdwxacts[status->cur_xact++];
+        Datum        values[PG_PREPARED_FDWXACTS_COLS];
+        bool        nulls[PG_PREPARED_FDWXACTS_COLS];
+        HeapTuple    tuple;
+        Datum        result;
+
+        if (!fdwxact->valid)
+            continue;
+
+        /*
+         * Form tuple with appropriate data.
+         */
+        MemSet(values, 0, sizeof(values));
+        MemSet(nulls, 0, sizeof(nulls));
+
+        values[0] = ObjectIdGetDatum(fdwxact->dbid);
+        values[1] = TransactionIdGetDatum(fdwxact->local_xid);
+        values[2] = ObjectIdGetDatum(fdwxact->serverid);
+        values[3] = ObjectIdGetDatum(fdwxact->userid);
+
+        switch (fdwxact->status)
+        {
+            case FDWXACT_STATUS_INITIAL:
+                xact_status = "initial";
+                break;
+            case FDWXACT_STATUS_PREPARING:
+                xact_status = "preparing";
+                break;
+            case FDWXACT_STATUS_PREPARED:
+                xact_status = "prepared";
+                break;
+            case FDWXACT_STATUS_COMMITTING:
+                xact_status = "committing";
+                break;
+            case FDWXACT_STATUS_ABORTING:
+                xact_status = "aborting";
+                break;
+            case FDWXACT_STATUS_RESOLVED:
+                xact_status = "resolved";
+                break;
+            default:
+                xact_status = "unknown";
+                break;
+        }
+        values[4] = CStringGetTextDatum(xact_status);
+        values[5] = BoolGetDatum(fdwxact->indoubt);
+        values[6] = PointerGetDatum(cstring_to_text_with_len(fdwxact->fdwxact_id,
+                                                             strlen(fdwxact->fdwxact_id)));
+
+        tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+        result = HeapTupleGetDatum(tuple);
+        SRF_RETURN_NEXT(funcctx, result);
+    }
+
+    SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to resolve a prepared foreign transaction manually.
+ */
+Datum
+pg_resolve_foreign_xact(PG_FUNCTION_ARGS)
+{
+    TransactionId    xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+    Oid                serverid = PG_GETARG_OID(1);
+    Oid                userid = PG_GETARG_OID(2);
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+    FdwXact            fdwxact;
+    FdwXactRslvState    *state;
+    FdwXactStatus        prev_status;
+
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 (errmsg("must be superuser to resolve foreign transactions"))));
+
+    server = GetForeignServer(serverid);
+    usermapping = GetUserMapping(userid, serverid);
+    state = create_fdwxact_state();
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    fdwxact = get_one_fdwxact(MyDatabaseId, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+    {
+        LWLockRelease(FdwXactLock);
+        PG_RETURN_BOOL(false);
+    }
+
+    state->server = server;
+    state->usermapping = usermapping;
+    state->fdwxact_id = pstrdup(fdwxact->fdwxact_id);
+
+    SpinLockAcquire(&fdwxact->mutex);
+    prev_status = fdwxact->status;
+    SpinLockRelease(&fdwxact->mutex);
+
+    FdwXactDetermineTransactionFate(fdwxact, false);
+
+    LWLockRelease(FdwXactLock);
+
+    FdwXactResolveForeignTransaction(fdwxact, state, prev_status);
+
+    PG_RETURN_BOOL(true);
+}
+
+/*
+ * Built-in function to remove a prepared foreign transaction entry without
+ * resolution. The function gives a way to forget about such prepared
+ * transaction in case: the foreign server where it is prepared is no longer
+ * available, the user which prepared this transaction needs to be dropped.
+ */
+Datum
+pg_remove_foreign_xact(PG_FUNCTION_ARGS)
+{
+    TransactionId    xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+    Oid                serverid = PG_GETARG_OID(1);
+    Oid                userid = PG_GETARG_OID(2);
+    FdwXact            fdwxact;
+
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 (errmsg("must be superuser to remove foreign transactions"))));
+
+    LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+    fdwxact = get_one_fdwxact(MyDatabaseId, xid, serverid, userid);
+
+    if (fdwxact == NULL)
+        PG_RETURN_BOOL(false);
+
+    remove_fdwxact(fdwxact);
+
+    LWLockRelease(FdwXactLock);
+
+    PG_RETURN_BOOL(true);
+}
diff --git a/src/backend/access/fdwxact/launcher.c b/src/backend/access/fdwxact/launcher.c
new file mode 100644
index 0000000000..45fb530916
--- /dev/null
+++ b/src/backend/access/fdwxact/launcher.c
@@ -0,0 +1,644 @@
+/*-------------------------------------------------------------------------
+ *
+ * launcher.c
+ *
+ * The foreign transaction resolver launcher process starts foreign
+ * transaction resolver processes. The launcher schedules resolver
+ * process to be started when arrived a requested by backend process.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/backend/access/fdwxact/launcher.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "pgstat.h"
+#include "funcapi.h"
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_resolver.h"
+#include "access/resolver_internal.h"
+#include "commands/dbcommands.h"
+#include "nodes/pg_list.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/proc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+
+/* max sleep time between cycles (3min) */
+#define DEFAULT_NAPTIME_PER_CYCLE 180000L
+
+static void fdwxact_launcher_onexit(int code, Datum arg);
+static void fdwxact_launcher_sighup(SIGNAL_ARGS);
+static void fdwxact_launch_resolver(Oid dbid);
+static bool fdwxact_relaunch_resolvers(void);
+
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t got_SIGUSR2 = false;
+FdwXactResolver *MyFdwXactResolver = NULL;
+
+/*
+ * Wake up the launcher process to retry resolution.
+ */
+void
+FdwXactLauncherRequestToLaunchForRetry(void)
+{
+    if (FdwXactRslvCtl->launcher_pid != InvalidPid)
+        SetLatch(FdwXactRslvCtl->launcher_latch);
+}
+
+/*
+ * Wake up the launcher process to request launching new resolvers
+ * immediately.
+ */
+void
+FdwXactLauncherRequestToLaunch(void)
+{
+    if (FdwXactRslvCtl->launcher_pid != InvalidPid)
+        kill(FdwXactRslvCtl->launcher_pid, SIGUSR2);
+}
+
+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)
+{
+    Size        size = 0;
+
+    size = add_size(size, SizeOfFdwXactRslvCtlData);
+    size = add_size(size, mul_size(max_foreign_xact_resolvers,
+                                   sizeof(FdwXactResolver)));
+
+    return size;
+}
+
+/*
+ * Allocate and initialize foreign transaction resolver shared
+ * memory.
+ */
+void
+FdwXactRslvShmemInit(void)
+{
+    bool found;
+
+    FdwXactRslvCtl = ShmemInitStruct("Foreign transactions resolvers",
+                                     FdwXactRslvShmemSize(),
+                                     &found);
+
+    if (!IsUnderPostmaster)
+    {
+        int    slot;
+
+        /* First time through, so initialize */
+        MemSet(FdwXactRslvCtl, 0, FdwXactRslvShmemSize());
+
+        SHMQueueInit(&(FdwXactRslvCtl->fdwxact_queue));
+
+        for (slot = 0; slot < max_foreign_xact_resolvers; slot++)
+        {
+            FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[slot];
+
+            resolver->pid = InvalidPid;
+            resolver->dbid = InvalidOid;
+            resolver->in_use = false;
+            resolver->last_resolved_time = 0;
+            resolver->latch = NULL;
+            SpinLockInit(&(resolver->mutex));
+        }
+    }
+}
+
+/*
+ * Cleanup function for fdwxact launcher
+ *
+ * Called on fdwxact launcher exit.
+ */
+static void
+fdwxact_launcher_onexit(int code, Datum arg)
+{
+    FdwXactRslvCtl->launcher_pid = InvalidPid;
+}
+
+/* SIGHUP: set flag to reload configuration at next convenient time */
+static void
+fdwxact_launcher_sighup(SIGNAL_ARGS)
+{
+    int    save_errno = errno;
+
+    got_SIGHUP = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/* SIGUSR2: set flag to launch new resolver process immediately */
+static void
+fdwxact_launcher_sigusr2(SIGNAL_ARGS)
+{
+    int    save_errno = errno;
+
+    got_SIGUSR2 = true;
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/*
+ * Main loop for the fdwxact launcher process.
+ */
+void
+FdwXactLauncherMain(Datum main_arg)
+{
+    TimestampTz    last_start_time = 0;
+
+    ereport(DEBUG1,
+            (errmsg("fdwxact resolver launcher started")));
+
+    before_shmem_exit(fdwxact_launcher_onexit, (Datum) 0);
+
+    Assert(FdwXactRslvCtl->launcher_pid == 0);
+    FdwXactRslvCtl->launcher_pid = MyProcPid;
+    FdwXactRslvCtl->launcher_latch = &MyProc->procLatch;
+
+    pqsignal(SIGHUP, fdwxact_launcher_sighup);
+    pqsignal(SIGUSR2, fdwxact_launcher_sigusr2);
+    pqsignal(SIGTERM, die);
+    BackgroundWorkerUnblockSignals();
+
+    BackgroundWorkerInitializeConnection(NULL, NULL, 0);
+
+    /* Enter main loop */
+    for (;;)
+    {
+        TimestampTz    now;
+        long    wait_time = DEFAULT_NAPTIME_PER_CYCLE;
+        int        rc;
+
+        CHECK_FOR_INTERRUPTS();
+        ResetLatch(MyLatch);
+
+        now = GetCurrentTimestamp();
+
+        /*
+         * Limit the start retry to once a foreign_xact_resolution_retry_interval
+         * but always starts when the backend requested.
+         */
+        if (got_SIGUSR2 ||
+            TimestampDifferenceExceeds(last_start_time, now,
+                                       foreign_xact_resolution_retry_interval))
+        {
+            MemoryContext oldctx;
+            MemoryContext subctx;
+            bool launched;
+
+            if (got_SIGUSR2)
+                got_SIGUSR2 = false;
+
+            subctx = AllocSetContextCreate(TopMemoryContext,
+                                           "Foreign Transaction Launcher",
+                                           ALLOCSET_DEFAULT_SIZES);
+            oldctx = MemoryContextSwitchTo(subctx);
+
+            /*
+             * Launch foreign transaction resolvers that are requested
+             * but not running.
+             */
+            launched = fdwxact_relaunch_resolvers();
+            if (launched)
+            {
+                last_start_time = now;
+                wait_time = foreign_xact_resolution_retry_interval;
+            }
+
+            /* Switch back to original memory context. */
+            MemoryContextSwitchTo(oldctx);
+            /* Clean the temporary memory. */
+            MemoryContextDelete(subctx);
+        }
+        else
+        {
+            /*
+             * The wait in previous cycle was interrupted in less than
+             * foreign_xact_resolution_retry_interval since last resolver
+             * started, this usually means crash of the resolver, so we
+             * should retry in foreign_xact_resolution_retry_interval again.
+             */
+            wait_time = foreign_xact_resolution_retry_interval;
+        }
+
+        /* Wait for more work */
+        rc = WaitLatch(MyLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       wait_time,
+                       WAIT_EVENT_FDWXACT_LAUNCHER_MAIN);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            proc_exit(1);
+
+        if (rc & WL_LATCH_SET)
+        {
+            ResetLatch(MyLatch);
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        if (got_SIGHUP)
+        {
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+    }
+
+    /* Not reachable */
+}
+
+/*
+ * Request launcher to launch a new foreign transaction resolver process
+ * or wake up the resolver if it's already running.
+ */
+void
+FdwXactLaunchOrWakeupResolver(void)
+{
+    volatile FdwXactResolver *resolver;
+    bool    found = false;
+    int        i;
+
+    /*
+     * Looking for a resolver process that is running and working on the
+     * same database.
+     */
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (resolver->in_use &&
+            resolver->dbid == MyDatabaseId)
+        {
+            found = true;
+            break;
+        }
+    }
+    LWLockRelease(FdwXactResolverLock);
+
+    if (found)
+    {
+        /* Found the running resolver */
+        elog(DEBUG1,
+             "found a running foreign transaction resolver process for database %u",
+             MyDatabaseId);
+
+        /*
+         * Wakeup the resolver. It's possible that the resolver is starting up
+         * and doesn't attach its slot yet. Since the resolver will find FdwXact
+         * entry we inserted soon we don't anything.
+         */
+        if (resolver->latch)
+            SetLatch(resolver->latch);
+
+        return;
+    }
+
+    /* Otherwise wake up the launcher to launch new resolver */
+    FdwXactLauncherRequestToLaunch();
+}
+
+/*
+ * Launch a foreign transaction resolver process that will connect to given
+ * 'dbid'.
+ */
+static void
+fdwxact_launch_resolver(Oid dbid)
+{
+    BackgroundWorker bgw;
+    BackgroundWorkerHandle *bgw_handle;
+    FdwXactResolver *resolver;
+    int unused_slot;
+    int i;
+
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    /* Find unused resolver slot */
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (!resolver->in_use)
+        {
+            unused_slot = i;
+            break;
+        }
+    }
+
+    /* No unused found */
+    if (unused_slot > max_foreign_xact_resolvers)
+        ereport(ERROR,
+                (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                 errmsg("out of foreign trasanction resolver slots"),
+                 errhint("You might need to increase max_foreign_transaction_resolvers.")));
+
+    resolver = &FdwXactRslvCtl->resolvers[unused_slot];
+    resolver->in_use = true;
+    resolver->dbid = dbid;
+    LWLockRelease(FdwXactResolverLock);
+
+    /* Register the new dynamic worker */
+    memset(&bgw, 0, sizeof(bgw));
+    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+        BGWORKER_BACKEND_DATABASE_CONNECTION;
+    bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+    snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+    snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactResolverMain");
+    snprintf(bgw.bgw_name, BGW_MAXLEN,
+             "foreign transaction resolver for database %u", resolver->dbid);
+    snprintf(bgw.bgw_type, BGW_MAXLEN, "foreign transaction resolver");
+    bgw.bgw_restart_time = BGW_NEVER_RESTART;
+    bgw.bgw_notify_pid = MyProcPid;
+    bgw.bgw_main_arg = Int32GetDatum(unused_slot);
+
+    if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+    {
+        /* Failed to launch, cleanup the worker slot */
+        SpinLockAcquire(&(MyFdwXactResolver->mutex));
+        resolver->in_use = false;
+        SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+        ereport(WARNING,
+                (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+                 errmsg("out of background worker slots"),
+                 errhint("You might need to increase max_worker_processes.")));
+    }
+
+    /*
+     * We don't need to wait until it attaches here because we're going to wait
+     * until all foreign transactions are resolved.
+     */
+}
+
+/*
+ * Launch or relaunch foreign transaction resolvers on database that has
+ * at least one FdwXact entry but no resolvers are running on it.
+ */
+static bool
+fdwxact_relaunch_resolvers(void)
+{
+    HTAB    *resolver_dbs;    /* DBs resolver's running on */
+    HTAB    *fdwxact_dbs;    /* DBs having at least one FdwXact entry */
+    HASHCTL    ctl;
+    HASH_SEQ_STATUS status;
+    Oid        *entry;
+    bool    launched;
+    int        i;
+
+    memset(&ctl, 0, sizeof(ctl));
+    ctl.keysize = sizeof(Oid);
+    ctl.entrysize = sizeof(Oid);
+    resolver_dbs = hash_create("resolver dblist",
+                               32, &ctl, HASH_ELEM | HASH_BLOBS);
+    fdwxact_dbs = hash_create("fdwxact dblist",
+                              32, &ctl, HASH_ELEM | HASH_BLOBS);
+
+    /* Collect database oids that has at least one non-in-doubt FdwXact entry */
+    LWLockAcquire(FdwXactLock, LW_SHARED);
+    for (i = 0; i < FdwXactCtl->num_fdwxacts; i++)
+    {
+        FdwXact fdwxact = FdwXactCtl->fdwxacts[i];
+
+        if (fdwxact->indoubt)
+            continue;
+
+        hash_search(fdwxact_dbs, &(fdwxact->dbid), HASH_ENTER, NULL);
+    }
+    LWLockRelease(FdwXactLock);
+
+    /* There is no FdwXact entry, no need to launch new one */
+    if (hash_get_num_entries(fdwxact_dbs) == 0)
+        return false;
+
+    /* Collect database oids on which resolvers are running */
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[i];
+
+        if (!resolver->in_use)
+            continue;
+
+        hash_search(resolver_dbs, &(resolver->dbid), HASH_ENTER, NULL);
+    }
+    LWLockRelease(FdwXactResolverLock);
+
+    /* Find DBs on which no resolvers are running and launch new one on them */
+    hash_seq_init(&status, fdwxact_dbs);
+    while ((entry = (Oid *) hash_seq_search(&status)) != NULL)
+    {
+        bool found;
+
+        hash_search(resolver_dbs, entry, HASH_FIND, &found);
+
+        if (!found)
+        {
+            /* No resolver is running on this database, launch new one */
+            fdwxact_launch_resolver(*entry);
+            launched = true;
+        }
+    }
+
+    return launched;
+}
+
+/*
+ * FdwXactLauncherRegister
+ *        Register a background worker running the foreign transaction
+ *      launcher.
+ */
+void
+FdwXactLauncherRegister(void)
+{
+    BackgroundWorker bgw;
+
+    if (max_foreign_xact_resolvers == 0)
+        return;
+
+    memset(&bgw, 0, sizeof(bgw));
+    bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+        BGWORKER_BACKEND_DATABASE_CONNECTION;
+    bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+    snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+    snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactLauncherMain");
+    snprintf(bgw.bgw_name, BGW_MAXLEN,
+             "foreign transaction launcher");
+    snprintf(bgw.bgw_type, BGW_MAXLEN,
+             "foreign transaction launcher");
+    bgw.bgw_restart_time = 5;
+    bgw.bgw_notify_pid = 0;
+    bgw.bgw_main_arg = (Datum) 0;
+
+    RegisterBackgroundWorker(&bgw);
+}
+
+bool
+IsFdwXactLauncher(void)
+{
+    return FdwXactRslvCtl->launcher_pid == MyProcPid;
+}
+
+/*
+ * Stop the fdwxact resolver running on the given database.
+ */
+Datum
+pg_stop_foreign_xact_resolver(PG_FUNCTION_ARGS)
+{
+    Oid dbid = PG_GETARG_OID(0);
+    FdwXactResolver *resolver = NULL;
+    int i;
+
+    /* Must be super user */
+    if (!superuser())
+        ereport(ERROR,
+                (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                 errmsg("permission denied to stop foreign transaction resolver")));
+
+    if (!OidIsValid(dbid))
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("invalid database id")));
+
+    LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+
+    /* Find the running resolver process on the given database */
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        resolver = &FdwXactRslvCtl->resolvers[i];
+
+        /* found! */
+        if (resolver->in_use && resolver->dbid == dbid)
+            break;
+    }
+
+    if (i >= max_foreign_xact_resolvers)
+        ereport(ERROR,
+                (errmsg("there is no running foreign trasaction resolver process on database %d",
+                        dbid)));
+
+    /* Found the resolver, terminate it ... */
+    kill(resolver->pid, SIGTERM);
+
+    /* ... and wait for it to die */
+    for (;;)
+    {
+        int rc;
+
+        /* is it gone? */
+        if (!resolver->in_use)
+            break;
+
+        LWLockRelease(FdwXactResolverLock);
+
+         /* Wait a bit --- we don't expect to have to wait long. */
+        rc = WaitLatch(MyLatch,
+                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                        10L, WAIT_EVENT_BGWORKER_SHUTDOWN);
+
+        if (rc & WL_LATCH_SET)
+        {
+            ResetLatch(MyLatch);
+            CHECK_FOR_INTERRUPTS();
+        }
+
+        LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+    }
+
+    LWLockRelease(FdwXactResolverLock);
+
+    PG_RETURN_BOOL(true);
+}
+
+/*
+ * Returns activity of all foreign transaction resolvers.
+ */
+Datum
+pg_stat_get_foreign_xact(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_FDWXACT_RESOLVERS_COLS 3
+    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+    TupleDesc    tupdesc;
+    Tuplestorestate *tupstore;
+    MemoryContext per_query_ctx;
+    MemoryContext oldcontext;
+    int i;
+
+    /* check to see if caller supports us returning a tuplestore */
+    if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("set-valued function called in context that cannot accept a set")));
+    if (!(rsinfo->allowedModes & SFRM_Materialize))
+        ereport(ERROR,
+                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                 errmsg("materialize mode required, but it is not " \
+                        "allowed in this context")));
+
+    /* Build a tuple descriptor for our result type */
+    if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+        elog(ERROR, "return type must be a row type");
+
+    per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+    oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+    tupstore = tuplestore_begin_heap(true, false, work_mem);
+    rsinfo->returnMode = SFRM_Materialize;
+    rsinfo->setResult = tupstore;
+    rsinfo->setDesc = tupdesc;
+
+    MemoryContextSwitchTo(oldcontext);
+
+    for (i = 0; i < max_foreign_xact_resolvers; i++)
+    {
+        FdwXactResolver    *resolver = &FdwXactRslvCtl->resolvers[i];
+        pid_t    pid;
+        Oid        dbid;
+        TimestampTz last_resolved_time;
+        Datum        values[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+        bool        nulls[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+
+
+        SpinLockAcquire(&(resolver->mutex));
+        if (resolver->pid == InvalidPid)
+        {
+            SpinLockRelease(&(resolver->mutex));
+            continue;
+        }
+
+        pid = resolver->pid;
+        dbid = resolver->dbid;
+        last_resolved_time = resolver->last_resolved_time;
+        SpinLockRelease(&(resolver->mutex));
+
+        memset(nulls, 0, sizeof(nulls));
+        /* pid */
+        values[0] = Int32GetDatum(pid);
+
+        /* dbid */
+        values[1] = ObjectIdGetDatum(dbid);
+
+        /* last_resolved_time */
+        if (last_resolved_time == 0)
+            nulls[2] = true;
+        else
+            values[2] = TimestampTzGetDatum(last_resolved_time);
+
+        tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+    }
+
+    /* clean up and return the tuplestore */
+    tuplestore_donestoring(tupstore);
+
+    return (Datum) 0;
+}
diff --git a/src/backend/access/fdwxact/resolver.c b/src/backend/access/fdwxact/resolver.c
new file mode 100644
index 0000000000..9298877f10
--- /dev/null
+++ b/src/backend/access/fdwxact/resolver.c
@@ -0,0 +1,344 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver.c
+ *
+ * The foreign transaction resolver background worker resolves foreign
+ * transactions that participate to a distributed transaction. A resolver
+ * process is started by foreign transaction launcher for each databases.
+ *
+ * A resolver process continues to resolve foreign transactions on the
+ * database, which the backend process is waiting for resolution.
+ *
+ * Normal termination is by SIGTERM, which instructs the resolver process
+ * to exit(0) at the next convenient moment. Emergency  termination is by
+ * SIGQUIT; like any backend. The resolver process also terminate by timeouts
+ * only if there is no pending foreign transactions on the database waiting
+ * to be resolved.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/backend/access/fdwxact/resolver.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
+#include "access/resolver_internal.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* max sleep time between cycles (3min) */
+#define DEFAULT_NAPTIME_PER_CYCLE 180000L
+
+/* GUC parameters */
+int foreign_xact_resolution_retry_interval;
+int foreign_xact_resolver_timeout = 60 * 1000;
+bool foreign_xact_resolve_indoubt_xacts;
+
+FdwXactRslvCtlData *FdwXactRslvCtl;
+
+static void FXRslvLoop(void);
+static long FXRslvComputeSleepTime(TimestampTz now, TimestampTz targetTime);
+static void FXRslvCheckTimeout(TimestampTz now);
+
+static void fdwxact_resolver_sighup(SIGNAL_ARGS);
+static void fdwxact_resolver_onexit(int code, Datum arg);
+static void fdwxact_resolver_detach(void);
+static void fdwxact_resolver_attach(int slot);
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/* Set flag to reload configuration at next convenient time */
+static void
+fdwxact_resolver_sighup(SIGNAL_ARGS)
+{
+    int        save_errno = errno;
+
+    got_SIGHUP = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
+/*
+ * Detach the resolver and cleanup the resolver info.
+ */
+static void
+fdwxact_resolver_detach(void)
+{
+    /* Block concurrent access */
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    MyFdwXactResolver->pid = InvalidPid;
+    MyFdwXactResolver->in_use = false;
+    MyFdwXactResolver->dbid = InvalidOid;
+
+    LWLockRelease(FdwXactResolverLock);
+}
+
+/*
+ * Cleanup up foreign transaction resolver info.
+ */
+static void
+fdwxact_resolver_onexit(int code, Datum arg)
+{
+    fdwxact_resolver_detach();
+
+    FdwXactLauncherRequestToLaunch();
+}
+
+/*
+ * Attach to a slot.
+ */
+static void
+fdwxact_resolver_attach(int slot)
+{
+    /* Block concurrent access */
+    LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+    Assert(slot >= 0 && slot < max_foreign_xact_resolvers);
+    MyFdwXactResolver = &FdwXactRslvCtl->resolvers[slot];
+
+    if (!MyFdwXactResolver->in_use)
+    {
+        LWLockRelease(FdwXactResolverLock);
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("foreign transaction resolver slot %d is empty, cannot attach",
+                        slot)));
+    }
+
+    Assert(OidIsValid(MyFdwXactResolver->dbid));
+
+    MyFdwXactResolver->pid = MyProcPid;
+    MyFdwXactResolver->latch = &MyProc->procLatch;
+    MyFdwXactResolver->last_resolved_time = 0;
+
+    before_shmem_exit(fdwxact_resolver_onexit, (Datum) 0);
+
+    LWLockRelease(FdwXactResolverLock);
+}
+
+/* Foreign transaction resolver entry point */
+void
+FdwXactResolverMain(Datum main_arg)
+{
+    int slot = DatumGetInt32(main_arg);
+
+    /* Attach to a slot */
+    fdwxact_resolver_attach(slot);
+
+    /* Establish signal handlers */
+    pqsignal(SIGHUP, fdwxact_resolver_sighup);
+    pqsignal(SIGTERM, die);
+    BackgroundWorkerUnblockSignals();
+
+    /* Connect to our database */
+    BackgroundWorkerInitializeConnectionByOid(MyFdwXactResolver->dbid, InvalidOid, 0);
+
+    StartTransactionCommand();
+
+    ereport(LOG,
+            (errmsg("foreign transaction resolver for database \"%s\" has started",
+                    get_database_name(MyFdwXactResolver->dbid))));
+
+    CommitTransactionCommand();
+
+    /* Initialize stats to a sanish value */
+    MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+
+    /* Run the main loop */
+    FXRslvLoop();
+
+    proc_exit(0);
+}
+
+/*
+ * Fdwxact resolver main loop
+ */
+static void
+FXRslvLoop(void)
+{
+    MemoryContext resolver_ctx;
+
+    resolver_ctx = AllocSetContextCreate(TopMemoryContext,
+                                         "Foreign Transaction Resolver",
+                                         ALLOCSET_DEFAULT_SIZES);
+
+    /* Enter main loop */
+    for (;;)
+    {
+        PGPROC            *waiter = NULL;
+        TransactionId    waitXid = InvalidTransactionId;
+        TimestampTz        resolutionTs = -1;
+        int            rc;
+        TimestampTz    now;
+        long        sleep_time = DEFAULT_NAPTIME_PER_CYCLE;
+
+        ResetLatch(MyLatch);
+
+        CHECK_FOR_INTERRUPTS();
+
+        MemoryContextSwitchTo(resolver_ctx);
+
+        if (got_SIGHUP)
+        {
+            got_SIGHUP = false;
+            ProcessConfigFile(PGC_SIGHUP);
+        }
+
+        now = GetCurrentTimestamp();
+
+        /*
+         * Process waiter until either the queue gets empty or got the waiter
+         * that has future resolution time.
+         */
+        while ((waiter = FdwXactGetWaiter(&resolutionTs, &waitXid)) != NULL)
+        {
+            CHECK_FOR_INTERRUPTS();
+            Assert(TransactionIdIsValid(waitXid));
+
+            if    (resolutionTs > now)
+                break;
+
+            elog(DEBUG2, "resolver got one waiter with xid %u", waitXid);
+
+            /* Resolve the waiting distributed transaction */
+            StartTransactionCommand();
+            FdwXactResolveTransactionAndReleaseWaiter(MyDatabaseId, waitXid,
+                                                      waiter);
+            CommitTransactionCommand();
+
+            /* Update my stats */
+            SpinLockAcquire(&(MyFdwXactResolver->mutex));
+            MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+            SpinLockRelease(&(MyFdwXactResolver->mutex));
+        }
+
+        FXRslvCheckTimeout(now);
+
+        sleep_time = FXRslvComputeSleepTime(now, resolutionTs);
+
+        MemoryContextResetAndDeleteChildren(resolver_ctx);
+        MemoryContextSwitchTo(TopMemoryContext);
+
+        rc = WaitLatch(MyLatch,
+                       WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+                       sleep_time,
+                       WAIT_EVENT_FDWXACT_RESOLVER_MAIN);
+
+        if (rc & WL_POSTMASTER_DEATH)
+            proc_exit(1);
+    }
+}
+
+/*
+ * Check whether there have been foreign transactions by the backend within
+ * foreign_xact_resolver_timeout and shutdown if not.
+ */
+static void
+FXRslvCheckTimeout(TimestampTz now)
+{
+    TimestampTz last_resolved_time;
+    TimestampTz timeout;
+
+    if (foreign_xact_resolver_timeout == 0)
+        return;
+
+    last_resolved_time = MyFdwXactResolver->last_resolved_time;
+    timeout = TimestampTzPlusMilliseconds(last_resolved_time,
+                                          foreign_xact_resolver_timeout);
+
+    if (now < timeout)
+        return;
+
+    LWLockAcquire(FdwXactResolutionLock, LW_SHARED);
+    if (!FdwXactWaiterExists(MyDatabaseId))
+    {
+        StartTransactionCommand();
+        ereport(LOG,
+                (errmsg("foreign transaction resolver for database \"%s\" will stop because the timeout",
+                        get_database_name(MyDatabaseId))));
+        CommitTransactionCommand();
+
+        /*
+         * Keep holding FdwXactResolutionLock until detached the slot. It is
+         * necessary to prevent a race condition; a waiter enqueues after
+         * checked FdwXactWaiterExists.
+         */
+        fdwxact_resolver_detach();
+        LWLockRelease(FdwXactResolutionLock);
+        proc_exit(0);
+    }
+    else
+        elog(DEBUG2, "resolver reached to the timeout but don't exist as the queue is not empty");
+
+    LWLockRelease(FdwXactResolutionLock);
+}
+
+/*
+ * Compute how long we should sleep by the next cycle. We can sleep until the time
+ * out or the next resolution time given by nextResolutionTs.
+ */
+static long
+FXRslvComputeSleepTime(TimestampTz now, TimestampTz nextResolutionTs)
+{
+    long    sleeptime = DEFAULT_NAPTIME_PER_CYCLE;
+
+    if (foreign_xact_resolver_timeout > 0)
+    {
+        TimestampTz timeout;
+        long    sec_to_timeout;
+        int        microsec_to_timeout;
+
+        /* Compute relative time until wakeup. */
+        timeout = TimestampTzPlusMilliseconds(MyFdwXactResolver->last_resolved_time,
+                                              foreign_xact_resolver_timeout);
+        TimestampDifference(now, timeout,
+                            &sec_to_timeout, µsec_to_timeout);
+
+        sleeptime = Min(sleeptime,
+                        sec_to_timeout * 1000 + microsec_to_timeout / 1000);
+    }
+
+    if (nextResolutionTs > 0)
+    {
+        long    sec_to_timeout;
+        int        microsec_to_timeout;
+
+        TimestampDifference(now, nextResolutionTs,
+                            &sec_to_timeout, µsec_to_timeout);
+
+        sleeptime = Min(sleeptime,
+                        sec_to_timeout * 1000 + microsec_to_timeout / 1000);
+    }
+
+    return sleeptime;
+}
+
+bool
+IsFdwXactResolver(void)
+{
+    return MyFdwXactResolver != NULL;
+}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index f88d72fd86..982c1a36cc 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -13,6 +13,7 @@ OBJS = \
     clogdesc.o \
     committsdesc.o \
     dbasedesc.o \
+    fdwxactdesc.o \
     genericdesc.o \
     gindesc.o \
     gistdesc.o \
diff --git a/src/backend/access/rmgrdesc/fdwxactdesc.c b/src/backend/access/rmgrdesc/fdwxactdesc.c
new file mode 100644
index 0000000000..fe0cef9472
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdwxactdesc.c
@@ -0,0 +1,58 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxactdesc.c
+ *        PostgreSQL global transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/backend/access/rmgrdesc/fdwxactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdwxact_xlog.h"
+
+void
+fdwxact_desc(StringInfo buf, XLogReaderState *record)
+{
+    char       *rec = XLogRecGetData(record);
+    uint8        info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+    if (info == XLOG_FDWXACT_INSERT)
+    {
+        FdwXactOnDiskData *fdwxact_insert = (FdwXactOnDiskData *) rec;
+
+        appendStringInfo(buf, "server: %u,", fdwxact_insert->serverid);
+        appendStringInfo(buf, " user: %u,", fdwxact_insert->userid);
+        appendStringInfo(buf, " database: %u,", fdwxact_insert->dbid);
+        appendStringInfo(buf, " local xid: %u,", fdwxact_insert->local_xid);
+        appendStringInfo(buf, " id: %s", fdwxact_insert->fdwxact_id);
+    }
+    else
+    {
+        xl_fdwxact_remove *fdwxact_remove = (xl_fdwxact_remove *) rec;
+
+        appendStringInfo(buf, "server: %u,", fdwxact_remove->serverid);
+        appendStringInfo(buf, " user: %u,", fdwxact_remove->userid);
+        appendStringInfo(buf, " database: %u,", fdwxact_remove->dbid);
+        appendStringInfo(buf, " local xid: %u", fdwxact_remove->xid);
+    }
+
+}
+
+const char *
+fdwxact_identify(uint8 info)
+{
+    switch (info & ~XLR_INFO_MASK)
+    {
+        case XLOG_FDWXACT_INSERT:
+            return "NEW FOREIGN TRANSACTION";
+        case XLOG_FDWXACT_REMOVE:
+            return "REMOVE FOREIGN TRANSACTION";
+    }
+    /* Keep compiler happy */
+    return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 33060f3042..1d4e1c82e1 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -114,7 +114,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
         appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
                          "max_wal_senders=%d max_prepared_xacts=%d "
                          "max_locks_per_xact=%d wal_level=%s "
-                         "wal_log_hints=%s track_commit_timestamp=%s",
+                         "wal_log_hints=%s track_commit_timestamp=%s "
+                         "max_prepared_foreign_transactions=%d",
                          xlrec.MaxConnections,
                          xlrec.max_worker_processes,
                          xlrec.max_wal_senders,
@@ -122,7 +123,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
                          xlrec.max_locks_per_xact,
                          wal_level_str,
                          xlrec.wal_log_hints ? "on" : "off",
-                         xlrec.track_commit_timestamp ? "on" : "off");
+                         xlrec.track_commit_timestamp ? "on" : "off",
+                         xlrec.max_prepared_foreign_xacts);
     }
     else if (info == XLOG_FPW_CHANGE)
     {
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 58091f6b52..200cf9d067 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -10,6 +10,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 529976885f..2c9af36bbb 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -850,6 +851,35 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)
     return result;
 }
 
+/*
+ * TwoPhaseExists
+ *        Return true if there is a prepared transaction specified by XID
+ */
+bool
+TwoPhaseExists(TransactionId xid)
+{
+    int        i;
+    bool    found = false;
+
+    LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+    for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+    {
+        GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+        PGXACT    *pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+
+        if (pgxact->xid == xid)
+        {
+            found = true;
+            break;
+        }
+    }
+
+    LWLockRelease(TwoPhaseStateLock);
+
+    return found;
+}
+
 /*
  * TwoPhaseGetDummyBackendId
  *        Get the dummy backend ID for prepared transaction specified by XID
@@ -2262,6 +2292,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
      * in the procarray and continue to hold locks.
      */
     SyncRepWaitForLSN(recptr, true);
+
+    /*
+     * Wait for foreign transaction prepared as part of this prepared
+     * transaction to be committed.
+     */
+    FdwXactWaitToBeResolved(xid, true);
 }
 
 /*
@@ -2321,6 +2357,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
      * in the procarray and continue to hold locks.
      */
     SyncRepWaitForLSN(recptr, false);
+
+    /*
+     * Wait for foreign transaction prepared as part of this prepared
+     * transaction to be committed.
+     */
+    FdwXactWaitToBeResolved(xid, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5353b6ab0b..5b67056c65 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1218,6 +1219,7 @@ RecordTransactionCommit(void)
     SharedInvalidationMessage *invalMessages = NULL;
     bool        RelcacheInitFileInval = false;
     bool        wrote_xlog;
+    bool        need_commit_globally;
 
     /* Get data needed for commit record */
     nrels = smgrGetPendingDeletes(true, &rels);
@@ -1226,6 +1228,7 @@ RecordTransactionCommit(void)
         nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
                                                      &RelcacheInitFileInval);
     wrote_xlog = (XactLastRecEnd != 0);
+    need_commit_globally = FdwXactIsForeignTwophaseCommitRequired();
 
     /*
      * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1264,12 +1267,13 @@ RecordTransactionCommit(void)
         }
 
         /*
-         * If we didn't create XLOG entries, we're done here; otherwise we
-         * should trigger flushing those entries the same as a commit record
+         * If we didn't create XLOG entries and the transaction does not need
+         * to be committed using two-phase commit. we're done here; otherwise
+         * we should trigger flushing those entries the same as a commit record
          * would.  This will primarily happen for HOT pruning and the like; we
          * want these to be flushed to disk in due time.
          */
-        if (!wrote_xlog)
+        if (!wrote_xlog && !need_commit_globally)
             goto cleanup;
     }
     else
@@ -1427,6 +1431,14 @@ RecordTransactionCommit(void)
     if (wrote_xlog && markXidCommitted)
         SyncRepWaitForLSN(XactLastRecEnd, true);
 
+    /*
+     * Wait for prepared foreign transaction to be resolved, if required.
+     * We only want to wait if we prepared foreign transaction in this
+     * transaction.
+     */
+    if (need_commit_globally && markXidCommitted)
+        FdwXactWaitToBeResolved(xid, true);
+
     /* remember end of last commit record */
     XactLastCommitEnd = XactLastRecEnd;
 
@@ -2086,6 +2098,10 @@ CommitTransaction(void)
             break;
     }
 
+ 
+    /* Pre-commit step for foreign transactions */
+    PreCommit_FdwXacts();
+
     CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
                       : XACT_EVENT_PRE_COMMIT);
 
@@ -2246,6 +2262,7 @@ CommitTransaction(void)
     AtEOXact_PgStat(true, is_parallel_worker);
     AtEOXact_Snapshot(true, false);
     AtEOXact_ApplyLauncher(true);
+    AtEOXact_FdwXacts(true);
     pgstat_report_xact_timestamp(0);
 
     CurrentResourceOwner = NULL;
@@ -2333,6 +2350,8 @@ PrepareTransaction(void)
      * the transaction-abort path.
      */
 
+    AtPrepare_FdwXacts();
+
     /* Shut down the deferred-trigger manager */
     AfterTriggerEndXact(true);
 
@@ -2527,6 +2546,7 @@ PrepareTransaction(void)
     AtEOXact_Files(true);
     AtEOXact_ComboCid();
     AtEOXact_HashTables(true);
+    AtEOXact_FdwXacts(true);
     /* don't call AtEOXact_PgStat here; we fixed pgstat state above */
     AtEOXact_Snapshot(true, true);
     pgstat_report_xact_timestamp(0);
@@ -2732,6 +2752,7 @@ AbortTransaction(void)
         AtEOXact_HashTables(false);
         AtEOXact_PgStat(false, is_parallel_worker);
         AtEOXact_ApplyLauncher(false);
+        AtEOXact_FdwXacts(false);
         pgstat_report_xact_timestamp(0);
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6bc1a6b46d..428a974c51 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/heaptoast.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
@@ -5246,6 +5247,7 @@ BootStrapXLOG(void)
     ControlFile->max_worker_processes = max_worker_processes;
     ControlFile->max_wal_senders = max_wal_senders;
     ControlFile->max_prepared_xacts = max_prepared_xacts;
+    ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
     ControlFile->max_locks_per_xact = max_locks_per_xact;
     ControlFile->wal_level = wal_level;
     ControlFile->wal_log_hints = wal_log_hints;
@@ -6189,6 +6191,9 @@ CheckRequiredParameterValues(void)
         RecoveryRequiresIntParameter("max_wal_senders",
                                      max_wal_senders,
                                      ControlFile->max_wal_senders);
+        RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+                                     max_prepared_foreign_xacts,
+                                     ControlFile->max_prepared_foreign_xacts);
         RecoveryRequiresIntParameter("max_prepared_transactions",
                                      max_prepared_xacts,
                                      ControlFile->max_prepared_xacts);
@@ -6729,14 +6734,15 @@ StartupXLOG(void)
     restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI);
 
     /*
-     * Before running in recovery, scan pg_twophase and fill in its status to
-     * be able to work on entries generated by redo.  Doing a scan before
-     * taking any recovery action has the merit to discard any 2PC files that
-     * are newer than the first record to replay, saving from any conflicts at
-     * replay.  This avoids as well any subsequent scans when doing recovery
-     * of the on-disk two-phase data.
+     * Before running in recovery, scan pg_twophase and pg_fdwxacts, and then
+     * fill in its status to be able to work on entries generated by redo.
+     * Doing a scan before taking any recovery action has the merit to discard
+     * any state files that are newer than the first record to replay, saving
+     * from any conflicts at replay.  This avoids as well any subsequent scans
+     * when doing recovery of the on-disk two-phase or fdwxact data.
      */
     restoreTwoPhaseData();
+    restoreFdwXactData();
 
     lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -6928,7 +6934,10 @@ StartupXLOG(void)
             InitRecoveryTransactionEnvironment();
 
             if (wasShutdown)
+            {
                 oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+                oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
+            }
             else
                 oldestActiveXID = checkPoint.oldestActiveXid;
             Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7424,6 +7433,7 @@ StartupXLOG(void)
      * as potential problems are detected before any on-disk change is done.
      */
     oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+    oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
     /*
      * Consider whether we need to assign a new timeline ID.
@@ -7754,6 +7764,9 @@ StartupXLOG(void)
     /* Reload shared-memory state for prepared transactions */
     RecoverPreparedTransactions();
 
+    /* Load all foreign transaction entries from disk to memory */
+    RecoverFdwXacts();
+
     /*
      * Shutdown the recovery environment. This must occur after
      * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9029,6 +9042,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
     CheckPointReplicationOrigin();
     /* We deliberately delay 2PC checkpointing as long as possible */
     CheckPointTwoPhase(checkPointRedo);
+    CheckPointFdwXacts(checkPointRedo);
 }
 
 /*
@@ -9462,8 +9476,10 @@ XLogReportParameters(void)
         max_worker_processes != ControlFile->max_worker_processes ||
         max_wal_senders != ControlFile->max_wal_senders ||
         max_prepared_xacts != ControlFile->max_prepared_xacts ||
+        max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts ||
         max_locks_per_xact != ControlFile->max_locks_per_xact ||
-        track_commit_timestamp != ControlFile->track_commit_timestamp)
+        track_commit_timestamp != ControlFile->track_commit_timestamp ||
+        max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
     {
         /*
          * The change in number of backend slots doesn't need to be WAL-logged
@@ -9481,6 +9497,7 @@ XLogReportParameters(void)
             xlrec.max_worker_processes = max_worker_processes;
             xlrec.max_wal_senders = max_wal_senders;
             xlrec.max_prepared_xacts = max_prepared_xacts;
+            xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
             xlrec.max_locks_per_xact = max_locks_per_xact;
             xlrec.wal_level = wal_level;
             xlrec.wal_log_hints = wal_log_hints;
@@ -9497,6 +9514,7 @@ XLogReportParameters(void)
         ControlFile->max_worker_processes = max_worker_processes;
         ControlFile->max_wal_senders = max_wal_senders;
         ControlFile->max_prepared_xacts = max_prepared_xacts;
+        ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
         ControlFile->max_locks_per_xact = max_locks_per_xact;
         ControlFile->wal_level = wal_level;
         ControlFile->wal_log_hints = wal_log_hints;
@@ -9702,6 +9720,7 @@ xlog_redo(XLogReaderState *record)
             RunningTransactionsData running;
 
             oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+            oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
             /*
              * Construct a RunningTransactions snapshot representing a shut
@@ -9901,6 +9920,7 @@ xlog_redo(XLogReaderState *record)
         ControlFile->max_worker_processes = xlrec.max_worker_processes;
         ControlFile->max_wal_senders = xlrec.max_wal_senders;
         ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+        ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
         ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
         ControlFile->wal_level = xlrec.wal_level;
         ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..b4c1cce1f0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -332,6 +332,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_foreign_xacts AS
+       SELECT * FROM pg_foreign_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
     l.objoid, l.classoid, l.objsubid,
@@ -818,6 +821,14 @@ CREATE VIEW pg_stat_subscription AS
             LEFT JOIN pg_stat_get_subscription(NULL) st
                       ON (st.subid = su.oid);
 
+CREATE VIEW pg_stat_foreign_xact AS
+    SELECT
+            r.pid,
+            r.dbid,
+            r.last_resolved_time
+    FROM pg_stat_get_foreign_xact() r
+    WHERE r.pid IS NOT NULL;
+
 CREATE VIEW pg_stat_ssl AS
     SELECT
             S.pid,
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 42a147b67d..e3caef7ef9 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2857,8 +2857,14 @@ CopyFrom(CopyState cstate)
 
     if (resultRelInfo->ri_FdwRoutine != NULL &&
         resultRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
+    {
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(RelationGetRelid(resultRelInfo->ri_RelationDesc),
+                               true);
+
         resultRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate,
                                                          resultRelInfo);
+    }
 
     /* Prepare to catch AFTER triggers. */
     AfterTriggerBeginQuery();
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 766c9f95c8..43bbe8356d 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,8 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
+#include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/table.h"
@@ -1101,6 +1103,18 @@ RemoveForeignServerById(Oid srvId)
     if (!HeapTupleIsValid(tp))
         elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+    /*
+     * If there is a foreign prepared transaction with this foreign server,
+     * dropping it might result in dangling prepared transaction.
+     */
+    if (fdwxact_exists(MyDatabaseId, srvId, InvalidOid))
+    {
+        Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+        ereport(WARNING,
+                (errmsg("server \"%s\" has unresolved prepared transactions on it",
+                        NameStr(srvForm->srvname))));
+    }
+
     CatalogTupleDelete(rel, &tp->t_self);
 
     ReleaseSysCache(tp);
@@ -1419,6 +1433,15 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 
     user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
+    /*
+     * If there is a foreign prepared transaction with this user mapping,
+     * dropping it might result in dangling prepared transaction.
+     */
+    if (fdwxact_exists(MyDatabaseId, srv->serverid,    useId))
+        ereport(WARNING,
+                (errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+                        srv->servername, MappingUserName(useId))));
+
     /*
      * Do the deletion
      */
@@ -1572,6 +1595,13 @@ ImportForeignSchema(ImportForeignSchemaStmt *stmt)
                  errmsg("foreign-data wrapper \"%s\" does not support IMPORT FOREIGN SCHEMA",
                         fdw->fdwname)));
 
+    /*
+     * Remember the transaction accesses to a foreign server. Normally during
+     * ImportForeignSchema we don't modify data on foreign servers, so remember it
+     * as not-modified server.
+     */
+    RegisterFdwXactByServerId(server->serverid, false);
+
     /* Call FDW to get a list of commands */
     cmd_list = fdw_routine->ImportForeignSchema(stmt, server->serverid);
 
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d23f292cb0..690717c34e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/table.h"
 #include "access/tableam.h"
 #include "catalog/partition.h"
@@ -944,7 +945,14 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
      */
     if (partRelInfo->ri_FdwRoutine != NULL &&
         partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
+    {
+        Relation        child = partRelInfo->ri_RelationDesc;
+
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(RelationGetRelid(child), true);
+
         partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
+    }
 
     partRelInfo->ri_PartitionInfo = partrouteinfo;
     partRelInfo->ri_CopyMultiInsertBuffer = NULL;
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 52af1dac5c..3ac56d1678 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -22,6 +22,8 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
+#include "access/xact.h"
 #include "executor/executor.h"
 #include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
@@ -224,9 +226,31 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
      * Tell the FDW to initialize the scan.
      */
     if (node->operation != CMD_SELECT)
+    {
+        RangeTblEntry    *rte;
+
+        rte = exec_rt_fetch(estate->es_result_relation_info->ri_RangeTableIndex,
+                            estate);
+
+        /* Remember the transaction modifies data on a foreign server*/
+        RegisterFdwXactByRelId(rte->relid, true);
+
         fdwroutine->BeginDirectModify(scanstate, eflags);
+    }
     else
+    {
+        RangeTblEntry    *rte;
+        int rtindex = (scanrelid > 0) ?
+            scanrelid :
+            bms_next_member(node->fs_relids, -1);
+
+        rte = exec_rt_fetch(rtindex, estate);
+
+        /* Remember the transaction accesses to a foreign server */
+        RegisterFdwXactByRelId(rte->relid, false);
+
         fdwroutine->BeginForeignScan(scanstate, eflags);
+    }
 
     return scanstate;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index cd91f9c8a8..c1ab3d829a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -37,6 +37,7 @@
 
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/tableam.h"
@@ -47,6 +48,7 @@
 #include "executor/executor.h"
 #include "executor/nodeModifyTable.h"
 #include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "rewrite/rewriteHandler.h"
@@ -549,6 +551,10 @@ ExecInsert(ModifyTableState *mtstate,
                                            NULL,
                                            specToken);
 
+            /* Make note that we've wrote on non-temprary relation */
+            if (RelationNeedsWAL(resultRelationDesc))
+                MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
             /* insert index entries for tuple */
             recheckIndexes = ExecInsertIndexTuples(slot, estate, true,
                                                    &specConflict,
@@ -777,6 +783,10 @@ ldelete:;
                                     &tmfd,
                                     changingPart);
 
+        /* Make note that we've wrote on non-temprary relation */
+        if (RelationNeedsWAL(resultRelationDesc))
+            MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
         switch (result)
         {
             case TM_SelfModified:
@@ -1323,6 +1333,10 @@ lreplace:;
                                     true /* wait for commit */ ,
                                     &tmfd, &lockmode, &update_indexes);
 
+        /* Make note that we've wrote on non-temprary relation */
+        if (RelationNeedsWAL(resultRelationDesc))
+            MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
         switch (result)
         {
             case TM_SelfModified:
@@ -2382,6 +2396,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
             resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL)
         {
             List       *fdw_private = (List *) list_nth(node->fdwPrivLists, i);
+            Oid            relid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+            /* Remember the transaction modifies data on a foreign server*/
+            RegisterFdwXactByRelId(relid, true);
 
             resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate,
                                                              resultRelInfo,
diff --git a/src/backend/foreign/foreign.c b/src/backend/foreign/foreign.c
index c917ec40ff..0b17505aac 100644
--- a/src/backend/foreign/foreign.c
+++ b/src/backend/foreign/foreign.c
@@ -187,6 +187,49 @@ GetForeignServerByName(const char *srvname, bool missing_ok)
     return GetForeignServer(serverid);
 }
 
+/*
+ * GetUserMappingOid - look up the user mapping by user mapping oid.
+ *
+ * If userid of the mapping is invalid, we set it to current userid.
+ */
+UserMapping *
+GetUserMappingByOid(Oid umid)
+{
+    Datum        datum;
+    HeapTuple   tp;
+    UserMapping    *um;
+    bool        isnull;
+    Form_pg_user_mapping tableform;
+
+    tp = SearchSysCache1(USERMAPPINGOID,
+                         ObjectIdGetDatum(umid));
+
+    if (!HeapTupleIsValid(tp))
+        ereport(ERROR,
+                (errcode(ERRCODE_UNDEFINED_OBJECT),
+                 errmsg("user mapping not found for %d", umid)));
+
+    tableform = (Form_pg_user_mapping) GETSTRUCT(tp);
+    um = (UserMapping *) palloc(sizeof(UserMapping));
+    um->umid = umid;
+    um->userid = OidIsValid(tableform->umuser) ?
+        tableform->umuser : GetUserId();
+    um->serverid = tableform->umserver;
+
+    /* Extract the umoptions */
+    datum = SysCacheGetAttr(USERMAPPINGUSERSERVER,
+                            tp,
+                            Anum_pg_user_mapping_umoptions,
+                            &isnull);
+    if (isnull)
+        um->options = NIL;
+    else
+        um->options = untransformRelOptions(datum);
+
+    ReleaseSysCache(tp);
+
+    return um;
+}
 
 /*
  * GetUserMapping - look up the user mapping.
@@ -328,6 +371,20 @@ GetFdwRoutine(Oid fdwhandler)
         elog(ERROR, "foreign-data wrapper handler function %u did not return an FdwRoutine struct",
              fdwhandler);
 
+    /* Sanity check for transaction management callbacks */
+    if ((routine->CommitForeignTransaction &&
+         !routine->RollbackForeignTransaction) ||
+        (!routine->CommitForeignTransaction &&
+         routine->RollbackForeignTransaction))
+        elog(ERROR,
+             "foreign-data-wrapper must support both commit and rollback routine or either");
+
+    if (routine->PrepareForeignTransaction &&
+        (!routine->CommitForeignTransaction ||
+         !routine->RollbackForeignTransaction))
+        elog(ERROR,
+             "foreign-data wrapper that supports prepare routine must support both commit and rollback routines");
+
     return routine;
 }
 
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 5f8a007e73..0a8890a984 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -14,6 +14,8 @@
 
 #include <unistd.h>
 
+#include "access/fdwxact_launcher.h"
+#include "access/fdwxact_resolver.h"
 #include "access/parallel.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -129,6 +131,12 @@ static const struct
     },
     {
         "ApplyWorkerMain", ApplyWorkerMain
+    },
+    {
+        "FdwXactResolverMain", FdwXactResolverMain
+    },
+    {
+        "FdwXactLauncherMain", FdwXactLauncherMain
     }
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index fabcf31de8..0d3932c2cf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3650,6 +3650,12 @@ pgstat_get_wait_activity(WaitEventActivity w)
         case WAIT_EVENT_CHECKPOINTER_MAIN:
             event_name = "CheckpointerMain";
             break;
+        case WAIT_EVENT_FDWXACT_RESOLVER_MAIN:
+            event_name = "FdwXactResolverMain";
+            break;
+        case WAIT_EVENT_FDWXACT_LAUNCHER_MAIN:
+            event_name = "FdwXactLauncherMain";
+            break;
         case WAIT_EVENT_LOGICAL_APPLY_MAIN:
             event_name = "LogicalApplyMain";
             break;
@@ -3853,6 +3859,11 @@ pgstat_get_wait_ipc(WaitEventIPC w)
         case WAIT_EVENT_SYNC_REP:
             event_name = "SyncRep";
             break;
+        case WAIT_EVENT_FDWXACT:
+            event_name = "FdwXact";
+        case WAIT_EVENT_FDWXACT_RESOLUTION:
+            event_name = "FdwXactResolution";
+            break;
             /* no default case, so that compiler will warn */
     }
 
@@ -4068,6 +4079,15 @@ pgstat_get_wait_io(WaitEventIO w)
         case WAIT_EVENT_TWOPHASE_FILE_WRITE:
             event_name = "TwophaseFileWrite";
             break;
+        case WAIT_EVENT_FDWXACT_FILE_WRITE:
+            event_name = "FdwXactFileWrite";
+            break;
+        case WAIT_EVENT_FDWXACT_FILE_READ:
+            event_name = "FdwXactFileRead";
+            break;
+        case WAIT_EVENT_FDWXACT_FILE_SYNC:
+            event_name = "FdwXactFileSync";
+            break;
         case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
             event_name = "WALSenderTimelineHistoryRead";
             break;
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9ff2832c00..f92be8387d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -93,6 +93,8 @@
 #include <pthread.h>
 #endif
 
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "bootstrap/bootstrap.h"
@@ -909,6 +911,10 @@ PostmasterMain(int argc, char *argv[])
         ereport(ERROR,
                 (errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
 
+    if (max_prepared_foreign_xacts > 0 && max_foreign_xact_resolvers == 0)
+        ereport(ERROR,
+                (errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires
max_foreign_transaction_resolvers> 0")));
 
+
     /*
      * Other one-time internal sanity checks can go here, if they are fast.
      * (Put any slow processing further down, after postmaster.pid creation.)
@@ -984,12 +990,13 @@ PostmasterMain(int argc, char *argv[])
 #endif
 
     /*
-     * Register the apply launcher.  Since it registers a background worker,
-     * it needs to be called before InitializeMaxBackends(), and it's probably
-     * a good idea to call it before any modules had chance to take the
-     * background worker slots.
+     * Register the apply launcher and foreign transaction launcher.  Since
+     * it registers a background worker, it needs to be called before
+     * InitializeMaxBackends(), and it's probably a good idea to call it
+     * before any modules had chance to take the background worker slots.
      */
     ApplyLauncherRegister();
+    FdwXactLauncherRegister();
 
     /*
      * process any libraries that should be preloaded at postmaster start
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index bc532d027b..6269f384af 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -151,6 +151,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
         case RM_COMMIT_TS_ID:
         case RM_REPLORIGIN_ID:
         case RM_GENERIC_ID:
+        case RM_FDWXACT_ID:
             /* just deal with xid, and done */
             ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
                                     buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4829953ee6..6bde7a735a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,8 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_launcher.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -147,6 +149,8 @@ CreateSharedMemoryAndSemaphores(void)
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
+        size = add_size(size, FdwXactShmemSize());
+        size = add_size(size, FdwXactRslvShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -263,6 +267,8 @@ CreateSharedMemoryAndSemaphores(void)
     BTreeShmemInit();
     SyncScanShmemInit();
     AsyncShmemInit();
+    FdwXactShmemInit();
+    FdwXactRslvShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 13bcbe77de..020eb76b6a 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -93,6 +93,8 @@ typedef struct ProcArrayStruct
     TransactionId replication_slot_xmin;
     /* oldest catalog xmin of any replication slot */
     TransactionId replication_slot_catalog_xmin;
+    /* local transaction id of oldest unresolved distributed transaction */
+    TransactionId fdwxact_unresolved_xmin;
 
     /* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
     int            pgprocnos[FLEXIBLE_ARRAY_MEMBER];
@@ -248,6 +250,7 @@ CreateSharedProcArray(void)
         procArray->lastOverflowedXid = InvalidTransactionId;
         procArray->replication_slot_xmin = InvalidTransactionId;
         procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+        procArray->fdwxact_unresolved_xmin = InvalidTransactionId;
     }
 
     allProcs = ProcGlobal->allProcs;
@@ -1312,6 +1315,7 @@ GetOldestXmin(Relation rel, int flags)
 
     TransactionId replication_slot_xmin = InvalidTransactionId;
     TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+    TransactionId fdwxact_unresolved_xmin = InvalidTransactionId;
 
     /*
      * If we're not computing a relation specific limit, or if a shared
@@ -1377,6 +1381,7 @@ GetOldestXmin(Relation rel, int flags)
      */
     replication_slot_xmin = procArray->replication_slot_xmin;
     replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+    fdwxact_unresolved_xmin = procArray->fdwxact_unresolved_xmin;
 
     if (RecoveryInProgress())
     {
@@ -1426,6 +1431,15 @@ GetOldestXmin(Relation rel, int flags)
         NormalTransactionIdPrecedes(replication_slot_xmin, result))
         result = replication_slot_xmin;
 
+    /*
+     * Check whether there are unresolved distributed transaction
+     * requiring an older xmin.
+     */
+    if (!(flags & PROCARRAY_FDWXACT_XMIN) &&
+        TransactionIdIsValid(fdwxact_unresolved_xmin) &&
+        NormalTransactionIdPrecedes(fdwxact_unresolved_xmin, result))
+        result = fdwxact_unresolved_xmin;
+
     /*
      * After locks have been released and vacuum_defer_cleanup_age has been
      * applied, check whether we need to back up further to make logical
@@ -3128,6 +3142,38 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
     LWLockRelease(ProcArrayLock);
 }
 
+/*
+ * ProcArraySetFdwXactUnresolvedXmin
+ *
+ * Install limits to future computations fo the xmin horizon to prevent
+ * vacuum clog from affected transactions still needed by resolving
+ * distributed transaction.
+ */
+void
+ProcArraySetFdwXactUnresolvedXmin(TransactionId xmin)
+{
+
+    LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+    procArray->fdwxact_unresolved_xmin = xmin;
+    LWLockRelease(ProcArrayLock);
+}
+
+/*
+ * ProcArrayGetFdwXactUnresolvedXmin
+ *
+ * Return the current unresolved xmin limits.
+ */
+TransactionId
+ProcArrayGetFdwXactUnresolvedXmin(void)
+{
+    TransactionId xmin;
+
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+    xmin = procArray->fdwxact_unresolved_xmin;
+    LWLockRelease(ProcArrayLock);
+
+    return xmin;
+}
 
 #define XidCacheRemove(i) \
     do { \
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843229..adb276370c 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,6 @@ MultiXactTruncationLock                41
 OldSnapshotTimeMapLock                42
 LogicalRepWorkerLock                43
 CLogTruncationLock                    44
+FdwXactLock                            45
+FdwXactResolverLock                    46
+FdwXactResolutionLock                47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index fff0628e58..af5e418a03 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -35,6 +35,7 @@
 #include <unistd.h>
 #include <sys/time.h>
 
+#include "access/fdwxact.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
@@ -421,6 +422,10 @@ InitProcess(void)
     MyProc->syncRepState = SYNC_REP_NOT_WAITING;
     SHMQueueElemInit(&(MyProc->syncRepLinks));
 
+    /* Initialize fields for fdw xact */
+    MyProc->fdwXactState = FDWXACT_NOT_WAITING;
+    SHMQueueElemInit(&(MyProc->fdwXactLinks));
+
     /* Initialize fields for group XID clearing. */
     MyProc->procArrayGroupMember = false;
     MyProc->procArrayGroupMemberXid = InvalidTransactionId;
@@ -822,6 +827,9 @@ ProcKill(int code, Datum arg)
     /* Make sure we're out of the sync rep lists */
     SyncRepCleanupAtProcExit();
 
+    /* Make sure we're out of the fdwxact lists */
+    FdwXactCleanupAtProcExit();
+
 #ifdef USE_ASSERT_CHECKING
     {
         int            i;
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3b85e48333..a0f8498862 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -36,6 +36,8 @@
 #include "rusagestub.h"
 #endif
 
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_launcher.h"
 #include "access/parallel.h"
 #include "access/printtup.h"
 #include "access/xact.h"
@@ -3029,6 +3031,18 @@ ProcessInterrupts(void)
              */
             proc_exit(1);
         }
+        else if (IsFdwXactResolver())
+            ereport(FATAL,
+                    (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                     errmsg("terminating foreign transaction resolver due to administrator command")));
+        else if (IsFdwXactLauncher())
+        {
+            /*
+             * The foreign transaction launcher can be stopped at any time.
+             * Use exit status 1 so the background worker is restarted.
+             */
+            proc_exit(1);
+        }
         else if (RecoveryConflictPending && RecoveryConflictRetryable)
         {
             pgstat_report_recovery_conflict(RecoveryConflictReason);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba74bf9f7d..d38c33b64c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/tableam.h"
@@ -399,6 +400,25 @@ static const struct config_enum_entry synchronous_commit_options[] = {
     {NULL, 0, false}
 };
 
+/*
+ * Although only "required", "prefer", and "disabled" are documented,
+ *  we accept all the likely variants of "on" and "off".
+ */
+static const struct config_enum_entry foreign_twophase_commit_options[] = {
+    {"required", FOREIGN_TWOPHASE_COMMIT_REQUIRED, false},
+    {"prefer", FOREIGN_TWOPHASE_COMMIT_PREFER, false},
+    {"disabled", FOREIGN_TWOPHASE_COMMIT_DISABLED, false},
+    {"on", FOREIGN_TWOPHASE_COMMIT_REQUIRED, false},
+    {"off", FOREIGN_TWOPHASE_COMMIT_DISABLED, false},
+    {"true", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"false", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {"yes", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"no", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {"1", FOREIGN_TWOPHASE_COMMIT_REQUIRED, true},
+    {"0", FOREIGN_TWOPHASE_COMMIT_DISABLED, true},
+    {NULL, 0, false}
+};
+
 /*
  * Although only "on", "off", "try" are documented, we accept all the likely
  * variants of "on" and "off".
@@ -725,6 +745,12 @@ const char *const config_group_names[] =
     gettext_noop("Client Connection Defaults / Other Defaults"),
     /* LOCK_MANAGEMENT */
     gettext_noop("Lock Management"),
+    /* FDWXACT */
+    gettext_noop("Foreign Transaction Management"),
+    /* FDWXACT_SETTINGS */
+    gettext_noop("Foreign Transaction Management / Settings"),
+    /* FDWXACT_RESOLVER */
+    gettext_noop("Foreign Transaction Management / Resolver"),
     /* COMPAT_OPTIONS */
     gettext_noop("Version and Platform Compatibility"),
     /* COMPAT_OPTIONS_PREVIOUS */
@@ -2370,6 +2396,52 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    /*
+     * See also CheckRequiredParameterValues() if this parameter changes
+     */
+    {
+        {"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+            NULL
+        },
+        &max_prepared_foreign_xacts,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"foreign_transaction_resolver_timeout", PGC_SIGHUP, FDWXACT_RESOLVER,
+            gettext_noop("Sets the maximum time to wait for foreign transaction resolution."),
+            NULL,
+            GUC_UNIT_MS
+        },
+        &foreign_xact_resolver_timeout,
+        60 * 1000, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"max_foreign_transaction_resolvers", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("Maximum number of foreign transaction resolution processes."),
+            NULL
+        },
+        &max_foreign_xact_resolvers,
+        0, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
+    {
+        {"foreign_transaction_resolution_retry_interval", PGC_SIGHUP, FDWXACT_RESOLVER,
+         gettext_noop("Sets the time to wait before retrying to resolve foreign transaction "
+                      "after a failed attempt."),
+         NULL,
+         GUC_UNIT_MS
+        },
+        &foreign_xact_resolution_retry_interval,
+        5000, 1, INT_MAX,
+        NULL, NULL, NULL
+    },
+
 #ifdef LOCK_DEBUG
     {
         {"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
@@ -4413,6 +4485,16 @@ static struct config_enum ConfigureNamesEnum[] =
         NULL, assign_synchronous_commit, NULL
     },
 
+    {
+        {"foreign_twophase_commit", PGC_USERSET, FDWXACT_SETTINGS,
+         gettext_noop("Use of foreign twophase commit for the current transaction."),
+            NULL
+        },
+        &foreign_twophase_commit,
+        FOREIGN_TWOPHASE_COMMIT_DISABLED, foreign_twophase_commit_options,
+        check_foreign_twophase_commit, NULL, NULL
+    },
+
     {
         {"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
             gettext_noop("Allows archiving of WAL files using archive_command."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9541879c1f..22e014aecd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -125,6 +125,8 @@
 #temp_buffers = 8MB            # min 800kB
 #max_prepared_transactions = 0        # zero disables the feature
                     # (change requires restart)
+#max_prepared_foreign_transactions = 0    # zero disables the feature
+                    # (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
 #work_mem = 4MB                # min 64kB
@@ -341,6 +343,20 @@
 #max_sync_workers_per_subscription = 2    # taken from max_logical_replication_workers
 
 
+#------------------------------------------------------------------------------
+# FOREIGN TRANSACTION
+#------------------------------------------------------------------------------
+
+#foreign_twophase_commit = off
+
+#max_foreign_transaction_resolvers = 0        # max number of resolver process
+                        # (change requires restart)
+#foreign_transaction_resolver_timeout = 60s    # in milliseconds; 0 disables
+#foreign_transaction_resolution_retry_interval = 5s    # time to wait before
+                            # retrying to resolve
+                            # foreign transactions
+                            # after a failed attempt
+
 #------------------------------------------------------------------------------
 # QUERY TUNING
 #------------------------------------------------------------------------------
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index f08a49c9dd..dd8878025b 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
     probe multixact__checkpoint__done(bool);
     probe twophase__checkpoint__start();
     probe twophase__checkpoint__done();
+    probe fdwxact__checkpoint__start();
+    probe fdwxact__checkpoint__done();
 
     probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
     probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1f6d8939be..49dc5a519f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -210,6 +210,7 @@ static const char *const subdirs[] = {
     "pg_snapshots",
     "pg_subtrans",
     "pg_twophase",
+    "pg_fdwxact",
     "pg_multixact",
     "pg_multixact/members",
     "pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 19e21ab491..9ae3bfe4dd 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,6 +301,8 @@ main(int argc, char *argv[])
            ControlFile->max_wal_senders);
     printf(_("max_prepared_xacts setting:           %d\n"),
            ControlFile->max_prepared_xacts);
+    printf(_("max_prepared_foreign_transactions setting:   %d\n"),
+           ControlFile->max_prepared_foreign_xacts);
     printf(_("max_locks_per_xact setting:           %d\n"),
            ControlFile->max_locks_per_xact);
     printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 2e286f6339..c5ee22132e 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -710,6 +710,7 @@ GuessControlValues(void)
     ControlFile.max_wal_senders = 10;
     ControlFile.max_worker_processes = 8;
     ControlFile.max_prepared_xacts = 0;
+    ControlFile.max_prepared_foreign_xacts = 0;
     ControlFile.max_locks_per_xact = 64;
 
     ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -914,6 +915,7 @@ RewriteControlFile(void)
     ControlFile.max_wal_senders = 10;
     ControlFile.max_worker_processes = 8;
     ControlFile.max_prepared_xacts = 0;
+    ControlFile.max_prepared_foreign_xacts = 0;
     ControlFile.max_locks_per_xact = 64;
 
     /* The control file gets flushed here. */
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 120000
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1..b616cea347 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -11,6 +11,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact_xlog.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdwxact.h b/src/include/access/fdwxact.h
new file mode 100644
index 0000000000..147d41c708
--- /dev/null
+++ b/src/include/access/fdwxact.h
@@ -0,0 +1,165 @@
+/*
+ * fdwxact.h
+ *
+ * PostgreSQL global transaction manager
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact.h
+ */
+#ifndef FDWXACT_H
+#define FDWXACT_H
+
+#include "access/fdwxact_xlog.h"
+#include "access/xlogreader.h"
+#include "foreign/foreign.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "nodes/execnodes.h"
+#include "storage/backendid.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/guc.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* fdwXactState */
+#define    FDWXACT_NOT_WAITING        0
+#define    FDWXACT_WAITING            1
+#define    FDWXACT_WAIT_COMPLETE    2
+
+/* Flag passed to FDW transaction management APIs */
+#define FDWXACT_FLAG_ONEPHASE        0x01    /* transaction can commit/rollback
+                                               without preparation */
+
+/* Enum for foreign_twophase_commit parameter */
+typedef enum
+{
+    FOREIGN_TWOPHASE_COMMIT_DISABLED,    /* disable foreign twophase commit */
+    FOREIGN_TWOPHASE_COMMIT_PREFER,        /* use twophase commit where available */
+    FOREIGN_TWOPHASE_COMMIT_REQUIRED    /* all foreign servers have to support
+                                           twophase commit */
+} ForeignTwophaseCommitLevel;
+
+/* Enum to track the status of foreign transaction */
+typedef enum
+{
+    FDWXACT_STATUS_INVALID,
+    FDWXACT_STATUS_INITIAL,
+    FDWXACT_STATUS_PREPARING,        /* foreign transaction is being prepared */
+    FDWXACT_STATUS_PREPARED,        /* foreign transaction is prepared */
+    FDWXACT_STATUS_COMMITTING,        /* foreign prepared transaction is to
+                                     * be committed */
+    FDWXACT_STATUS_ABORTING,        /* foreign prepared transaction is to be
+                                     * aborted */
+    FDWXACT_STATUS_RESOLVED
+} FdwXactStatus;
+
+typedef struct FdwXactData *FdwXact;
+
+/*
+ * Shared memory state of a single foreign transaction.
+ */
+typedef struct FdwXactData
+{
+    FdwXact            fdwxact_free_next;    /* Next free FdwXact entry */
+
+    Oid                dbid;            /* database oid where to find foreign server
+                                     * and user mapping */
+    TransactionId    local_xid;        /* XID of local transaction */
+    Oid                serverid;        /* foreign server where transaction takes
+                                     * place */
+    Oid                userid;            /* user who initiated the foreign
+                                     * transaction */
+    Oid                umid;
+    bool            indoubt;        /* Is an in-doubt transaction? */
+    slock_t            mutex;            /* Protect the above fields */
+
+    /* The status of the foreign transaction, protected by FdwXactLock */
+    FdwXactStatus     status;
+    /*
+     * Note that we need to keep track of two LSNs for each FdwXact. We keep
+     * track of the start LSN because this is the address we must use to read
+     * state data back from WAL when committing a FdwXact. We keep track of
+     * the end LSN because that is the LSN we need to wait for prior to
+     * commit.
+     */
+    XLogRecPtr    insert_start_lsn;        /* XLOG offset of inserting this entry start */
+    XLogRecPtr    insert_end_lsn;        /* XLOG offset of inserting this entry end */
+
+    bool        valid;            /* has the entry been complete and written to file? */
+    BackendId    held_by;        /* backend who are holding */
+    bool        ondisk;            /* true if prepare state file is on disk */
+    bool        inredo;            /* true if entry was added via xlog_redo */
+
+    char        fdwxact_id[FDWXACT_ID_MAX_LEN];        /* prepared transaction identifier */
+} FdwXactData;
+
+/*
+ * Shared memory layout for maintaining foreign prepared transaction entries.
+ * Adding or removing FdwXact entry needs to hold FdwXactLock in exclusive mode,
+ * and iterating fdwXacts needs that in shared mode.
+ */
+typedef struct
+{
+    /* Head of linked list of free FdwXactData structs */
+    FdwXact        free_fdwxacts;
+
+    /* Number of valid foreign transaction entries */
+    int            num_fdwxacts;
+
+    /* Upto max_prepared_foreign_xacts entries in the array */
+    FdwXact        fdwxacts[FLEXIBLE_ARRAY_MEMBER];        /* Variable length array */
+} FdwXactCtlData;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+FdwXactCtlData *FdwXactCtl;
+
+/* State data for foreign transaction resolution, passed to FDW callbacks */
+typedef struct FdwXactRslvState
+{
+    /* Foreign transaction information */
+    char    *fdwxact_id;
+
+    ForeignServer    *server;
+    UserMapping        *usermapping;
+
+    int        flags;            /* OR of FDWXACT_FLAG_xx flags */
+} FdwXactRslvState;
+
+/* GUC parameters */
+extern int    max_prepared_foreign_xacts;
+extern int    max_foreign_xact_resolvers;
+extern int    foreign_xact_resolution_retry_interval;
+extern int    foreign_xact_resolver_timeout;
+extern int    foreign_twophase_commit;
+
+/* Function declarations */
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern void restoreFdwXactData(void);
+extern TransactionId PrescanFdwXacts(TransactionId oldestActiveXid);
+extern void RecoverFdwXacts(void);
+extern void AtEOXact_FdwXacts(bool is_commit);
+extern void AtPrepare_FdwXacts(void);
+extern bool fdwxact_exists(Oid dboid, Oid serverid, Oid userid);
+extern void CheckPointFdwXacts(XLogRecPtr redo_horizon);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FdwXacts(void);
+extern void KnownFdwXactRecreateFiles(XLogRecPtr redo_horizon);
+extern void FdwXactWaitToBeResolved(TransactionId wait_xid, bool commit);
+extern bool FdwXactIsForeignTwophaseCommitRequired(void);
+extern void FdwXactResolveTransactionAndReleaseWaiter(Oid dbid, TransactionId xid,
+                                                      PGPROC *waiter);
+extern bool FdwXactResolveInDoubtTransactions(Oid dbid);
+extern PGPROC *FdwXactGetWaiter(TimestampTz *nextResolutionTs_p, TransactionId *waitXid_p);
+extern void FdwXactCleanupAtProcExit(void);
+extern void RegisterFdwXactByRelId(Oid relid, bool modified);
+extern void RegisterFdwXactByServerId(Oid serverid, bool modified);
+extern void FdwXactMarkForeignServerAccessed(Oid relid, bool modified);
+extern bool check_foreign_twophase_commit(int *newval, void **extra,
+                                          GucSource source);
+extern bool FdwXactWaiterExists(Oid dbid);
+
+#endif   /* FDWXACT_H */
diff --git a/src/include/access/fdwxact_launcher.h b/src/include/access/fdwxact_launcher.h
new file mode 100644
index 0000000000..dd0f5d16ff
--- /dev/null
+++ b/src/include/access/fdwxact_launcher.h
@@ -0,0 +1,29 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_launcher.h
+ *      PostgreSQL foreign transaction launcher definitions
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_launcher.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef FDWXACT_LAUNCHER_H
+#define FDWXACT_LAUNCHER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactLauncherRegister(void);
+extern void FdwXactLauncherMain(Datum main_arg);
+extern void FdwXactLauncherRequestToLaunch(void);
+extern void FdwXactLauncherRequestToLaunchForRetry(void);
+extern void FdwXactLaunchOrWakeupResolver(void);
+extern Size FdwXactRslvShmemSize(void);
+extern void FdwXactRslvShmemInit(void);
+extern bool IsFdwXactLauncher(void);
+
+
+#endif    /* FDWXACT_LAUNCHER_H */
diff --git a/src/include/access/fdwxact_resolver.h b/src/include/access/fdwxact_resolver.h
new file mode 100644
index 0000000000..2607654024
--- /dev/null
+++ b/src/include/access/fdwxact_resolver.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_resolver.h
+ *      PostgreSQL foreign transaction resolver definitions
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_resolver.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_RESOLVER_H
+#define FDWXACT_RESOLVER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactResolverMain(Datum main_arg);
+extern bool IsFdwXactResolver(void);
+
+extern int foreign_xact_resolver_timeout;
+
+#endif        /* FDWXACT_RESOLVER_H */
diff --git a/src/include/access/fdwxact_xlog.h b/src/include/access/fdwxact_xlog.h
new file mode 100644
index 0000000000..39ca66beef
--- /dev/null
+++ b/src/include/access/fdwxact_xlog.h
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_xlog.h
+ *      Foreign transaction XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_XLOG_H
+#define FDWXACT_XLOG_H
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDWXACT_INSERT    0x00
+#define XLOG_FDWXACT_REMOVE    0x10
+
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define FDWXACT_ID_MAX_LEN 200
+
+/*
+ * On disk file structure, also used to WAL
+ */
+typedef struct
+{
+    TransactionId local_xid;
+    Oid            dbid;            /* database oid where to find foreign server
+                                 * and user mapping */
+    Oid            serverid;        /* foreign server where transaction takes
+                                 * place */
+    Oid            userid;            /* user who initiated the foreign transaction */
+    Oid            umid;
+    char        fdwxact_id[FDWXACT_ID_MAX_LEN]; /* foreign txn prepare id */
+} FdwXactOnDiskData;
+
+typedef struct xl_fdwxact_remove
+{
+    TransactionId xid;
+    Oid            serverid;
+    Oid            userid;
+    Oid            dbid;
+    bool        force;
+} xl_fdwxact_remove;
+
+extern void fdwxact_redo(XLogReaderState *record);
+extern void fdwxact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdwxact_identify(uint8 info);
+
+#endif    /* FDWXACT_XLOG_H */
diff --git a/src/include/access/resolver_internal.h b/src/include/access/resolver_internal.h
new file mode 100644
index 0000000000..55fc970b69
--- /dev/null
+++ b/src/include/access/resolver_internal.h
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver_internal.h
+ *      Internal headers shared by fdwxact resolvers.
+ *
+ * Portions Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * src/include/access/resovler_internal.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RESOLVER_INTERNAL_H
+#define RESOLVER_INTERNAL_H
+
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/timestamp.h"
+
+/*
+ * Each foreign transaction resolver has a FdwXactResolver struct in
+ * shared memory.  This struct is protected by FdwXactResolverLaunchLock.
+ */
+typedef struct FdwXactResolver
+{
+    pid_t    pid;    /* this resolver's PID, or 0 if not active */
+    Oid        dbid;    /* database oid */
+
+    /* Indicates if this slot is used of free */
+    bool    in_use;
+
+    /* Stats */
+    TimestampTz    last_resolved_time;
+
+    /* Protect shared variables shown above */
+    slock_t    mutex;
+
+    /*
+     * Pointer to the resolver's patch. Used by backends to wake up this
+     * resolver when it has work to do. NULL if the resolver isn't active.
+     */
+    Latch    *latch;
+} FdwXactResolver;
+
+/* There is one FdwXactRslvCtlData struct for the whole database cluster */
+typedef struct FdwXactRslvCtlData
+{
+    /* Foreign transaction resolution queue. Protected by FdwXactLock */
+    SHM_QUEUE    fdwxact_queue;
+
+    /* Supervisor process and latch */
+    pid_t        launcher_pid;
+    Latch        *launcher_latch;
+
+    FdwXactResolver resolvers[FLEXIBLE_ARRAY_MEMBER];
+} FdwXactRslvCtlData;
+#define SizeOfFdwXactRslvCtlData \
+    (offsetof(FdwXactRslvCtlData, resolvers) + sizeof(FdwXactResolver))
+
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+extern FdwXactResolver *MyFdwXactResolver;
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+#endif    /* RESOLVER_INTERNAL_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2ccf5..5798b4cd99 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL,
NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDWXACT_ID, "Foreign Transactions", fdwxact_redo, fdwxact_desc, fdwxact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 02b5315c43..e8c094d708 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -36,6 +36,7 @@ extern void PostPrepare_Twophase(void);
 
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid, bool lock_held);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid, bool lock_held);
+extern bool    TwoPhaseExists(TransactionId xid);
 
 extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
                                          TimestampTz prepared_at,
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb5c4935d2..a75e6998f0 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -108,6 +108,13 @@ extern int    MyXactFlags;
  */
 #define XACT_FLAGS_WROTENONTEMPREL                (1U << 2)
 
+/*
+ * XACT_FLAGS_FDWNONPREPARE - set when we wrote data on foreign table of which
+ * server isn't capable of two-phase commit
+ * relation.
+ */
+#define XACT_FLAGS_FDWNOPREPARE                    (1U << 3)
+
 /*
  *    start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e295dc65fb..d1ce20242f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -232,6 +232,7 @@ typedef struct xl_parameter_change
     int            max_worker_processes;
     int            max_wal_senders;
     int            max_prepared_xacts;
+    int            max_prepared_foreign_xacts;
     int            max_locks_per_xact;
     int            wal_level;
     bool        wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index cf7d4485e9..f2174a0208 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -179,6 +179,7 @@ typedef struct ControlFileData
     int            max_worker_processes;
     int            max_wal_senders;
     int            max_prepared_xacts;
+    int            max_prepared_foreign_xacts;
     int            max_locks_per_xact;
     bool        track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..1072c38aa6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5184,6 +5184,13 @@
   proargmodes => '{i,o,o,o,o,o,o,o,o}',
   proargnames =>
'{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}',
   prosrc => 'pg_stat_get_subscription' },
+{ oid => '9705', descr => 'statistics: information about foreign transaction resolver',
+  proname => 'pg_stat_get_foreign_xact', proisstrict => 'f', provolatile => 's',
+  proparallel => 'r', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,oid,timestamptz}',
+  proargmodes => '{o,o,o}',
+  proargnames => '{pid,dbid,last_resolved_time}',
+  prosrc => 'pg_stat_get_foreign_xact' },
 { oid => '2026', descr => 'statistics: current backend PID',
   proname => 'pg_backend_pid', provolatile => 's', proparallel => 'r',
   prorettype => 'int4', proargtypes => '', prosrc => 'pg_backend_pid' },
@@ -5897,6 +5904,24 @@
   proargnames => '{type,object_names,object_args,classid,objid,objsubid}',
   prosrc => 'pg_get_object_address' },
 
+{ oid => '9706', descr => 'view foreign transactions',
+  proname => 'pg_foreign_xacts', prorows => '1000', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,xid,oid,oid,text,bool,text}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{dbid,xid,serverid,userid,status,in_doubt,identifier}',
+  prosrc => 'pg_foreign_xacts' },
+{ oid => '9707', descr => 'remove foreign transaction without resolution',
+  proname => 'pg_remove_foreign_xact', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'xid oid oid',
+  proargnames => '{xid,serverid,userid}',
+  prosrc => 'pg_remove_foreign_xact' },
+{ oid => '9708', descr => 'resolve one foreign transaction',
+  proname => 'pg_resolve_foreign_xact', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'xid oid oid',
+  proargnames => '{xid,serverid,userid}',
+  prosrc => 'pg_resolve_foreign_xact' },
+
 { oid => '2079', descr => 'is table visible in search path?',
   proname => 'pg_table_is_visible', procost => '10', provolatile => 's',
   prorettype => 'bool', proargtypes => 'oid', prosrc => 'pg_table_is_visible' },
@@ -6015,6 +6040,10 @@
 { oid => '2851', descr => 'wal filename, given a wal location',
   proname => 'pg_walfile_name', prorettype => 'text', proargtypes => 'pg_lsn',
   prosrc => 'pg_walfile_name' },
+{ oid => '9709',
+  descr => 'stop a foreign transaction resolver process running on the given database',
+  proname => 'pg_stop_foreing_xact_resolver', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'oid', prosrc => 'pg_stop_foreign_xact_resolver'},
 
 { oid => '3165', descr => 'difference in bytes, given two wal locations',
   proname => 'pg_wal_lsn_diff', prorettype => 'numeric',
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 822686033e..c7b33d72ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdwxact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/pathnodes.h"
@@ -169,6 +170,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
                                                             List *fdw_private,
                                                             RelOptInfo *child_rel);
+typedef void (*PrepareForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef void (*CommitForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef void (*RollbackForeignTransaction_function) (FdwXactRslvState *frstate);
+typedef char *(*GetPrepareId_function) (TransactionId xid, Oid serverid,
+                                        Oid userid, int *prep_id_len);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -236,6 +242,12 @@ typedef struct FdwRoutine
     /* Support functions for IMPORT FOREIGN SCHEMA */
     ImportForeignSchema_function ImportForeignSchema;
 
+    /* Support functions for transaction management */
+    PrepareForeignTransaction_function PrepareForeignTransaction;
+    CommitForeignTransaction_function CommitForeignTransaction;
+    RollbackForeignTransaction_function RollbackForeignTransaction;
+    GetPrepareId_function GetPrepareId;
+
     /* Support functions for parallelism under Gather node */
     IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
     EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/foreign/foreign.h b/src/include/foreign/foreign.h
index 4de157c19c..91c2276915 100644
--- a/src/include/foreign/foreign.h
+++ b/src/include/foreign/foreign.h
@@ -69,6 +69,7 @@ extern ForeignServer *GetForeignServerExtended(Oid serverid,
                                                bits16 flags);
 extern ForeignServer *GetForeignServerByName(const char *name, bool missing_ok);
 extern UserMapping *GetUserMapping(Oid userid, Oid serverid);
+extern UserMapping *GetUserMappingByOid(Oid umid);
 extern ForeignDataWrapper *GetForeignDataWrapper(Oid fdwid);
 extern ForeignDataWrapper *GetForeignDataWrapperExtended(Oid fdwid,
                                                          bits16 flags);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fe076d823d..d82d8f7abc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -776,6 +776,8 @@ typedef enum
     WAIT_EVENT_BGWRITER_HIBERNATE,
     WAIT_EVENT_BGWRITER_MAIN,
     WAIT_EVENT_CHECKPOINTER_MAIN,
+    WAIT_EVENT_FDWXACT_RESOLVER_MAIN,
+    WAIT_EVENT_FDWXACT_LAUNCHER_MAIN,
     WAIT_EVENT_LOGICAL_APPLY_MAIN,
     WAIT_EVENT_LOGICAL_LAUNCHER_MAIN,
     WAIT_EVENT_PGSTAT_MAIN,
@@ -853,7 +855,9 @@ typedef enum
     WAIT_EVENT_REPLICATION_ORIGIN_DROP,
     WAIT_EVENT_REPLICATION_SLOT_DROP,
     WAIT_EVENT_SAFE_SNAPSHOT,
-    WAIT_EVENT_SYNC_REP
+    WAIT_EVENT_SYNC_REP,
+    WAIT_EVENT_FDWXACT,
+    WAIT_EVENT_FDWXACT_RESOLUTION
 } WaitEventIPC;
 
 /* ----------
@@ -933,6 +937,9 @@ typedef enum
     WAIT_EVENT_TWOPHASE_FILE_READ,
     WAIT_EVENT_TWOPHASE_FILE_SYNC,
     WAIT_EVENT_TWOPHASE_FILE_WRITE,
+    WAIT_EVENT_FDWXACT_FILE_READ,
+    WAIT_EVENT_FDWXACT_FILE_WRITE,
+    WAIT_EVENT_FDWXACT_FILE_SYNC,
     WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
     WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
     WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 281e1db725..c802201193 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/xlogdefs.h"
+#include "datatype/timestamp.h"
 #include "lib/ilist.h"
 #include "storage/latch.h"
 #include "storage/lock.h"
@@ -152,6 +153,16 @@ struct PGPROC
     int            syncRepState;    /* wait state for sync rep */
     SHM_QUEUE    syncRepLinks;    /* list link if process is in syncrep queue */
 
+    /*
+     * Info to allow us to wait for foreign transaction to be resolved, if
+     * needed.
+     */
+    TransactionId    fdwXactWaitXid;    /* waiting for foreign transaction involved with
+                                     * this transaction id to be resolved */
+    int            fdwXactState;    /* wait state for foreign transaction resolution */
+    SHM_QUEUE    fdwXactLinks;    /* list link if process is in queue */
+    TimestampTz fdwXactNextResolutionTs;
+
     /*
      * All PROCLOCK objects for locks held or awaited by this backend are
      * linked into one of these lists, according to the partition number of
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 8f67b860e7..deb293c1a9 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -36,6 +36,8 @@
 
 #define        PROCARRAY_SLOTS_XMIN            0x20    /* replication slot xmin,
                                                      * catalog_xmin */
+#define        PROCARRAY_FDWXACT_XMIN            0x40    /* unresolved distributed
+                                                       transaciton xmin */
 /*
  * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
  * PGXACT->vacuumFlags. Other flags are used for different purposes and
@@ -125,4 +127,7 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
                                             TransactionId *catalog_xmin);
 
+
+extern void ProcArraySetFdwXactUnresolvedXmin(TransactionId xmin);
+extern TransactionId ProcArrayGetFdwXactUnresolvedXmin(void);
 #endif                            /* PROCARRAY_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index d68976fafa..d5fec50969 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -96,6 +96,9 @@ enum config_group
     CLIENT_CONN_PRELOAD,
     CLIENT_CONN_OTHER,
     LOCK_MANAGEMENT,
+    FDWXACT,
+    FDWXACT_SETTINGS,
+    FDWXACT_RESOLVER,
     COMPAT_OPTIONS,
     COMPAT_OPTIONS_PREVIOUS,
     COMPAT_OPTIONS_CLIENT,
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c9cc569404..ed229d5a67 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1341,6 +1341,14 @@ pg_file_settings| SELECT a.sourcefile,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
+pg_foreign_xacts| SELECT f.dbid,
+    f.xid,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.in_doubt,
+    f.identifier
+   FROM pg_foreign_xacts() f(dbid, xid, serverid, userid, status, in_doubt, identifier);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
     ARRAY( SELECT pg_auth_members.member
@@ -1841,6 +1849,11 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
+pg_stat_foreign_xact| SELECT r.pid,
+    r.dbid,
+    r.last_resolved_time
+   FROM pg_stat_get_foreign_xact() r(pid, dbid, last_resolved_time)
+  WHERE (r.pid IS NOT NULL);
 pg_stat_gssapi| SELECT s.pid,
     s.gss_auth AS gss_authenticated,
     s.gss_princ AS principal,
-- 
2.23.0

From 3363abd531595233fb59e0ab6078a011ab8060e9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:08 +0900
Subject: [PATCH v26 3/5] Documentation update.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 doc/src/sgml/catalogs.sgml                | 145 +++++++++++++
 doc/src/sgml/config.sgml                  | 146 ++++++++++++-
 doc/src/sgml/distributed-transaction.sgml | 158 +++++++++++++++
 doc/src/sgml/fdwhandler.sgml              | 236 ++++++++++++++++++++++
 doc/src/sgml/filelist.sgml                |   1 +
 doc/src/sgml/func.sgml                    |  89 ++++++++
 doc/src/sgml/monitoring.sgml              |  60 ++++++
 doc/src/sgml/postgres.sgml                |   1 +
 doc/src/sgml/storage.sgml                 |   6 +
 9 files changed, 841 insertions(+), 1 deletion(-)
 create mode 100644 doc/src/sgml/distributed-transaction.sgml

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 55694c4368..1b720da03d 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8267,6 +8267,11 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l
       <entry>open cursors</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-foreign-xacts"><structname>pg_foreign_xacts</structname></link></entry>
+      <entry>foreign transactions</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-file-settings"><structname>pg_file_settings</structname></link></entry>
       <entry>summary of configuration file contents</entry>
@@ -9712,6 +9717,146 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  </sect1>
 
+ <sect1 id="view-pg-foreign-xacts">
+  <title><structname>pg_foreign_xacts</structname></title>
+
+  <indexterm zone="view-pg-foreign-xacts">
+   <primary>pg_foreign_xacts</primary>
+  </indexterm>
+
+  <para>
+   The view <structname>pg_foreign_xacts</structname> displays
+   information about foreign transactions that are opened on
+   foreign servers for atomic distributed transaction commit (see
+   <xref linkend="atomic-commit"/> for details).
+  </para>
+
+  <para>
+   <structname>pg_foreign_xacts</structname> contains one row per foreign
+   transaction.  An entry is removed when the foreign transaction is
+   committed or rolled back.
+  </para>
+
+  <table>
+   <title><structname>pg_foreign_xacts</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><structfield>dbid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link
linkend="catalog-pg-database"><structname>pg_database</structname></link>.oid</literal></entry>
+      <entry>
+       OID of the database which the foreign transaction resides in
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>xid</structfield></entry>
+      <entry><type>xid</type></entry>
+      <entry></entry>
+      <entry>
+       Numeric transaction identifier with that this foreign transaction
+       associates
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>serverid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link
linkend="catalog-pg-foreign-server"><structname>pg_foreign_server</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the foreign server on that the foreign transaction is prepared
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>userid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="view-pg-user"><structname>pg_user</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the user that prepared this foreign transaction.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       Status of foreign transaction. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>initial</literal> : Initial status.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>preparing</literal> : This foreign transaction is being prepared.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>prepared</literal> : This foreign transaction has been prepared.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>committing</literal> : This foreign transcation is being committed.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>aborting</literal> : This foreign transaction is being aborted.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>resolved</literal> : This foreign transaction has been resolved.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>in_doubt</structfield></entry>
+      <entry><type>boolean</type></entry>
+      <entry></entry>
+      <entry>
+       If <literal>true</literal> this foreign transaction is in-dbout status and
+       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
+       function.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>identifier</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       The identifier of the prepared foreign transaction.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   When the <structname>pg_prepared_xacts</structname> view is accessed, the
+   internal transaction manager data structures are momentarily locked, and
+   a copy is made for the view to display.  This ensures that the
+   view produces a consistent set of results, while not blocking
+   normal operations longer than necessary.  Nonetheless
+   there could be some impact on database performance if this view is
+   frequently accessed.
+  </para>
+
+ </sect1>
+
  <sect1 id="view-pg-publication-tables">
   <title><structname>pg_publication_tables</structname></title>
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 53ac14490a..69778750f3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4378,7 +4378,6 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
 
      </variablelist>
     </sect2>
-
    </sect1>
 
    <sect1 id="runtime-config-query">
@@ -8818,6 +8817,151 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
      </variablelist>
    </sect1>
 
+   <sect1 id="runtime-config-distributed-transaction">
+    <title>Distributed Transaction Management</title>
+
+    <sect2 id="runtime-config-distributed-transaction-settings">
+     <title>Setting</title>
+     <variablelist>
+
+      <varlistentry id="guc-foreign-twophase-commit" xreflabel="foreign_twophse_commit">
+       <term><varname>foreign_twophase_commit</varname> (<type>enum</type>)
+        <indexterm>
+         <primary><varname>foreign_twophase_commit</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies whether transaction commit will wait for all involving foreign
+         transaction to be resolved before the command returns a "success"
+         indication to the client. Valid values are <literal>required</literal>,
+         <literal>prefer</literal> and <literal>disabled</literal>. The default
+         setting is <literal>disabled</literal>. Setting to
+         <literal>disabled</literal> don't use two-phase commit protocol to
+         commit or rollback distributed transactions. When set to
+         <literal>required</literal> the distributed transaction strictly
+         requires that all written servers can use two-phase commit protocol.
+         That is, the distributed transaction cannot commit if even one server
+         does not support the transaction management callback routines
+         (described in <xref linkend="fdw-callbacks-transaction-managements"/>).
+         When set to <literal>prefer</literal> the distributed transaction use
+         two-phase commit protocol on only servers where available and commit on
+         others. Note that when <literal>disabled</literal> or
+         <literal>prefer</literal> there can be risk of database consistency
+         among all servers that involved in the distributed transaction when some
+         foreign server crashes during committing the distributed transaction.
+        </para>
+
+        <para>
+         Both <varname>max_prepared_foreign_transactions</varname> and
+         <varname>max_foreign_transaction_resolvers</varname> must be non-zero
+         value to set this parameter either <literal>required</literal> or
+         <literal>prefer</literal>.
+        </para>
+
+        <para>
+         This parameter can be changed at any time; the behavior for any one
+         transaction is determined by the setting in effect when it commits.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+       <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>max_prepared_foreign_transactions</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Sets the maximum number of foreign transactions that can be prepared
+         simultaneously. A single local transaction can give rise to multiple
+         foreign transaction. If <literal>N</literal> local transactions each
+         across <literal>K</literal> foreign server this value need to be set
+         <literal>N * K</literal>, not just <literal>N</literal>.
+         This parameter can only be set at server start.
+        </para>
+        <para>
+         When running a standby server, you must set this parameter to the
+         same or higher value than on the master server. Otherwise, queries
+         will not be allowed in the standby server.
+        </para>
+       </listitem>
+      </varlistentry>
+
+     </variablelist>
+    </sect2>
+
+    <sect2 id="runtime-config-foreign-transaction-resolver">
+     <title>Foreign Transaction Resolvers</title>
+
+     <para>
+      These settings control the behavior of a foreign transaction resolver.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-max-foreign-transaction-resolvers" xreflabel="max_foreign_transaction_resolvers">
+       <term><varname>max_foreign_transaction_resolvers</varname> (<type>int</type>)
+        <indexterm>
+         <primary><varname>max_foreign_transaction_resolvers</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies maximum number of foreign transaction resolution workers. A foreign transaction
+         resolver is responsible for foreign transaction resolution on one database.
+        </para>
+        <para>
+         Foreign transaction resolution workers are taken from the pool defined by
+         <varname>max_worker_processes</varname>.
+        </para>
+        <para>
+         The default value is 0.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-foreign-transaction-resolution-rety-interval"
xreflabel="foreign_transaction_resolution_retry_interval">
+       <term><varname>foreign_transaction_resolution_retry_interval</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolution_interval</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specify how long the foreign transaction resolver should wait when the last resolution
+         fails before retrying to resolve foreign transaction. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command line.
+        </para>
+        <para>
+         The default value is 10 seconds.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-foreign-transaction-resolver-timeout" xreflabel="foreign_transaction_resolver_timeout">
+       <term><varname>foreign_transaction_resolver_timeout</varname> (<type>integer</type>)
+        <indexterm>
+         <primary><varname>foreign_transaction_resolver_timeout</varname> configuration parameter</primary>
+        </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Terminate foreign transaction resolver processes that don't have any foreign
+         transactions to resolve longer than the specified number of milliseconds.
+         A value of zero disables the timeout mechanism, meaning it connects to one
+         database until stopping manually. This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command line.
+        </para>
+        <para>
+         The default value is 60 seconds.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+    </sect2>
+   </sect1>
+
    <sect1 id="runtime-config-compatible">
     <title>Version and Platform Compatibility</title>
 
diff --git a/doc/src/sgml/distributed-transaction.sgml b/doc/src/sgml/distributed-transaction.sgml
new file mode 100644
index 0000000000..350b1afe68
--- /dev/null
+++ b/doc/src/sgml/distributed-transaction.sgml
@@ -0,0 +1,158 @@
+<!-- doc/src/sgml/distributed-transaction.sgml -->
+
+<chapter id="distributed-transaction">
+ <title>Distributed Transaction</title>
+
+ <para>
+  A distributed transaction is a transaction in which two or more network hosts
+  are involved. <productname>PostgreSQL</productname>'s global Transaction
+  manager supports distributed transactions that access foreign servers using
+  Foreign Data Wrappers. The global transaction manager is responsible for
+  managing transactions on foreign servers.
+ </para>
+
+ <sect1 id="atomic-commit">
+  <title>Atomic Commit</title>
+
+  <para>
+   Atomic commit of distributed transaction is an operation that applies a set
+   of changes as a single operation globally. This guarantees all-or-nothing
+   results for the changes on all remote hosts involved in.
+   <productname>PostgreSQL</productname> provides a way to perform read-write
+   transactions with foreign resources using foreign data wrappers.
+   Using the <productname>PostgreSQL</productname>'s atomic commit ensures that
+   all changes on foreign servers end in either commit or rollback using the
+   transaction callback routines
+   (see <xref linkend="fdw-callbacks-transaction-managements"/>).
+  </para>
+
+  <sect2>
+   <title>Atomic Commit Using Two-phase Commit Protocol</title>
+
+   <para>
+    To achieve commit among all foreign servers automatially,
+    <productname>PostgreSQL</productname> employs two-phase commit protocol,
+    which is a type of atomic commitment protocol (ACP).
+    A <productname>PostgreSQL</productname> server that received SQL is called
+    <firstterm>coordinator node</firstterm> who is responsible for coordinating
+    all the partipanting transactions. Using two-phase commit protocol, the commit
+    sequence of distributed transaction performs with the following steps.
+    <orderedlist>
+     <listitem>
+      <para>
+       Prepare all transactions on foreign servers.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Commit locally.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       Resolve all prepared transaction on foreign servers.
+      </para>
+     </listitem>
+    </orderedlist>
+
+   </para>
+
+   <para>
+    At the first step, <productname>PostgreSQL</productname> distributed
+    transaction manager prepares all transaction on the foreign servers if
+    two-phase commit is required. Two-phase commit is required when the
+    transaction modifies data on two or more servers including the local server
+    itself and <xref linkend="guc-foreign-twophase-commit"/>is
+    <literal>required</literal> or <literal>prefer</literal>. If all preparations
+    on foreign servers got successful go to the next step. Any failure happens
+    in this step <productname>PostgreSQL</productname> changes to rollback, then
+    rollback all transactions on both local and foreign servers.
+   </para>
+
+   <para>
+    At the local commit step, <productname>PostgreSQL</productname> commit the
+    transaction locally. Any failure happens in this step
+    <productname>PostgreSQL</productname> changes rollback, then rollback all
+    transactions on both local and foreign servers.
+   </para>
+
+   <para>
+    At the final step, prepared transactions are resolved by a foreign transaction
+    resolver process.
+   </para>
+  </sect2>
+
+  <sect2 id="atomic-commit-transaction-resolution">
+   <title>Foreign Transaction Resolver Processes</title>
+
+   <para>
+    Foreign transaction resolver processes are auxiliary processes that is
+    responsible for foreign transaction resolution. They commit or rollback all
+    prepared transaction on foreign servers if the coordinator received agreement
+    messages from all foreign servers during the first step.
+   </para>
+
+   <para>
+    One foreign transaction resolver is responsible for transaction resolutions
+    on one database of the coordinator side. On failure during resolution, they
+    retries to resolve at an interval of
+    <varname>foreign_transaction_resolution_interval</varname> time.
+   </para>
+
+   <note>
+    <para>
+     During a foreign transaction resolver process connecting to the database,
+     database cannot be dropped. So to drop the database, you can call
+     <function>pg_stop_foreign_xact_resovler</function> function before dropping
+     the database.
+    </para>
+   </note>
+  </sect2>
+
+  <sect2 id="atomic-commit-in-doubt-transaction">
+   <title>Manual Resolution of In-Doubt Transactions</title>
+
+   <para>
+    The atomic commit mechanism ensures that all foreign servers either commit
+    or rollback using two-phase commit protocol. However, distributed transactions
+    become <firstterm>in-doubt</firstterm> in three cases: where the foreign
+    server crashed or lost the connectibility to it during preparing foreign
+    transaction, where the coordinator node crashed during either preparing or
+    resolving distributed transaction and where user canceled the query. You can
+    check in-doubt transaction in <xref linkend="pg-stat-foreign-xact-view"/>
+    view. These foreign transactions need to be resolved by using
+    <function>pg_resolve_foriegn_xact</function> function.
+    <productname>PostgreSQL</productname> doesn't have facilities to automatially
+    resolve in-doubt transactions. These behavior might change in a future release.
+   </para>
+  </sect2>
+
+  <sect2 id="atomic-commit-monitoring">
+   <title>Monitoring</title>
+   <para>
+    The monitoring information about foreign transaction resolvers is visible in
+    <link linkend="pg-stat-foreign-xact-view"><literal>pg_stat_foreign_xact</literal></link>
+    view. This view contains one row for every foreign transaction resolver worker.
+   </para>
+  </sect2>
+
+  <sect2>
+   <title>Configuration Settings</title>
+
+   <para>
+    Atomic commit requires several configuration options to be set.
+   </para>
+
+   <para>
+    On the coordinator side, <xref linkend="guc-max-prepared-foreign-transactions"/> and
+    <xref linkend="guc-max-foreign-transaction-resolvers"/> must be non-zero value.
+    Additionally the <varname>max_worker_processes</varname> may need to be adjusted to
+    accommodate for foreign transaction resolver workers, at least
+    (<varname>max_foreign_transaction_resolvers</varname> + <literal>1</literal>).
+    Note that some extensions and parallel queries also take worker slots from
+    <varname>max_worker_processes</varname>.
+   </para>
+
+  </sect2>
+ </sect1>
+</chapter>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 6587678af2..dd0358ef22 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1415,6 +1415,127 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
     </para>
    </sect2>
 
+   <sect2 id="fdw-callbacks-transaction-managements">
+    <title>FDW Routines For Transaction Managements</title>
+
+    <para>
+     Transaction management callbacks are used for doing commit, rollback and
+     prepare the foreign transaction. If an FDW wishes that its foreign
+     transaction is managed by <productname>PostgreSQL</productname>'s global
+     transaction manager it must provide both
+     <function>CommitForeignTransaction</function> and
+     <function>RollbackForeignTransaction</function>. In addition, if an FDW
+     wishes to support <firstterm>atomic commit</firstterm> (as described in
+     <xref linkend="fdw-transaction-managements"/>), it must provide
+     <function>PrepareForeignTransaction</function> as well and can provide
+     <function>GetPrepareId</function> callback optionally.
+    </para>
+
+    <para>
+<programlisting>
+void
+PrepareForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Prepare the transaction on the foreign server. This function is called at the
+    pre-commit phase of the local transactions if foreign twophase commit is
+    required. This function is used only for distribute transaction management
+    (see <xref linkend="distributed-transaction"/>).
+    </para>
+
+    <para>
+     Note that this callback function is always executed by backend processes.
+    </para>
+    <para>
+<programlisting>
+bool
+CommitForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Commit the foreign transaction. This function is called either at
+    the pre-commit phase of the local transaction if the transaction
+    can be committed in one-phase or at the post-commit phase if
+    two-phase commit is required. If <literal>frstate->flag</literal> has
+    the flag <literal>FDW_XACT_FLAG_ONEPHASE</literal> the transaction
+    can be committed in one-phase, this function must commit the prepared
+    transaction identified by <literal>frstate->fdwxact_id</literal>.
+    </para>
+
+    <para>
+     The foreign transaction identified by <literal>frstate->fdwxact_id</literal>
+     might not exist on the foreign servers. This can happen when, for instance,
+     <productname>PostgreSQL</productname> server crashed during preparing or
+     committing the foreign tranasction. Therefore, this function needs to
+     tolerate the undefined object error
+     (<literal>ERRCODE_UNDEFINED_OBJECT</literal>) rather than raising an error.
+    </para>
+
+    <para>
+     Note that all cases except for calling <function>pg_resolve_fdwxact</function>
+     SQL function, this callback function is executed by foreign transaction
+     resolver processes.
+    </para>
+    <para>
+<programlisting>
+bool
+RollbackForeignTransaction(FdwXactRslvState *frstate);
+</programlisting>
+    Rollback the foreign transaction. This function is called either at
+    the end of the local transaction after rolled back locally. The foreign
+    transactions are rolled back when user requested rollbacking or when
+    any error occurs during the transaction. This function must be tolerate to
+    being called recursively if any error occurs during rollback the foreign
+    transaction. So you would need to track recursion and prevent being called
+    infinitely. If <literal>frstate->flag</literal> has the flag
+    <literal>FDW_XACT_FLAG_ONEPHASE</literal> the transaction can be rolled
+    back in one-phase, otherwise this function must rollback the prepared
+    transaction identified by <literal>frstate->fdwxact_id</literal>.
+    </para>
+
+    <para>
+     The foreign transaction identified by <literal>frstate->fdwxact_id</literal>
+     might not exist on the foreign servers. This can happen when, for instance,
+     <productname>PostgreSQL</productname> server crashed during preparing or
+     committing the foreign tranasction. Therefore, this function needs to
+     tolerate the undefined object error
+     (<literal>ERRCODE_UNDEFINED_OBJECT</literal>) rather than raising an error.
+    </para>
+
+    <para>
+     Note that all cases except for calling <function>pg_resolve_fdwxact</function>
+     SQL function, this callback function is executed by foreign transaction
+     resolver processes.
+    </para>
+    <para>
+<programlisting>
+char *
+GetPrepareId(TransactionId xid, Oid serverid, Oid userid, int *prep_id_len);
+</programlisting>
+    Return null terminated string that represents prepared transaction identifier
+    with its length <varname>*prep_id_len</varname>.
+    This optional function is called during executor startup for once per the
+    foreign server. Note that the transaction identifier must be string literal,
+    less than <symbol>NAMEDATALEN</symbol> bytes long and should not be same
+    as any other concurrent prepared transaction id. If this callback routine
+    is not supported, <productname>PostgreSQL</productname>'s  distributed
+    transaction manager generates an unique identifier with in the form of
+    <literal>fx_<random value up to 2<superscript>31</superscript>>_<server oid>_<user
oid></literal>.
+    </para>
+
+    <para>
+     Note that this callback function is always executed by backend processes.
+    </para>
+
+    <note>
+     <para>
+      Functions <function>PrepareForeignTransaction</function>,
+      <function>CommitForeignTransaction</function> and
+      <function>RolblackForeignTransaction</function> are called
+      at outside of a valid transaction state. So please note that
+      you cannot use functions that use the system catalog cache
+      such as Foreign Data Wrapper helper functions described in
+      <xref linkend="fdw-helpers"/>.
+     </para>
+    </note>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
@@ -1894,4 +2015,119 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+  <sect1 id="fdw-transaction-managements">
+   <title>Transaction managements for Foreign Data Wrappers</title>
+   <para>
+    If a FDW's server supports transaction, it is usually worthwhile for the
+    FDW to manage transaction opened on the foreign server. The FDW callback
+    function <literal>CommitForeignTransaction</literal>,
+    <literal>RollbackForeignTransaction</literal> and
+    <literal>PrepareForeignTransaction</literal> are used to manage Transaction
+    management and must fit into the working of the
+    <productname>PostgreSQL</productname> transaction processing.
+   </para>
+
+   <para>
+    The information in <literal>FdwXactRslvState</literal> can be used to get
+    information of foreign server being processed such as server name, OID of
+    server, user and user mapping. The <literal>flags</literal> has contains flag
+    bit describing the foreign transaction state for transaction management.
+   </para>
+
+   <sect2 id="fdw-transaction-commit-rollback">
+    <title>Commit And Rollback Single Foreign Transaction</title>
+    <para>
+     The FDW callback function <literal>CommitForeignTransaction</literal>
+     and <literal>RollbackForeignTransaction</literal> can be used to commit
+     and rollback the foreign transaction. During transaction commit, the core
+     transaction manager calls <literal>CommitForeignTransaction</literal> function
+     in the pre-commit phase and calls
+     <literal>RollbackForeignTransaction</literal> function in the post-rollback
+     phase.
+    </para>
+   </sect2>
+
+   <sect2 id="fdw-transaction-distributed-transaction-commit">
+    <title>Atomic Commit And Rollback Distributed Transaction</title>
+    <para>
+     In addition to simply commit and rollback foreign transactions described at
+     <xref linkend="fdw-transaction-commit-rollback"/>,
+     <productname>PostgreSQL</productname> global transaction manager enables
+     distributed transactions to atomically commit and rollback among all foreign
+     servers, which is as known as atomic commit in literature. To achieve atomic
+     commit, <productname>PostgreSQL</productname> employs two-phase commit
+     protocol, which is a type of atomic commitment protocol. Every FDWs that wish
+     to support two-phase commit protocol are required to have the FDW callback
+     function <function>PrepareForeignTransaction</function> and optionally
+     <function>GetPrepareId</function>, in addition to
+     <function>CommitForeignTransaction</function> and
+     <function>RollbackForeignTransaction</function>
+     (see <xref linkend="fdw-callbacks-transaction-managements"/> for details).
+    </para>
+
+    <para>
+     An example of distributed transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+     When the core executor access the foreign servers, foreign servers whose FDW
+     supports transaction management callback routines is registered as a participant.
+     During registration, <function>GetPrepareId</function> is called if provided to
+     generate an unique transaction identifer.
+    </para>
+
+    <para>
+     During pre-commit phase of local transaction, the foreign transaction manager
+     persists the foreign transaction information to the disk and WAL, and then
+     prepare all foreign transaction by calling
+     <function>PrepareForeignTransaction</function> if two-phase commit protocol
+     is required. Two-phase commit is required when the transaction modified data
+     on more than one servers including the local server itself and user requests
+     foreign twophase commit (see <xref linkend="guc-foreign-twophase-commit"/>).
+    </para>
+
+    <para>
+     <productname>PostgreSQL</productname> can commit locally and go to the next
+     step if and only if all foreign transactions are prepared successfully.
+     If any failure happens or user requests to cancel during preparation,
+     the distributed transaction manager changes over rollback and calls
+     <function>RollbackForeignTransaction</function>.
+    </para>
+
+    <para>
+     Note that when <literal>(frstate->flags & FDWXACT_FLAG_ONEPHASE)</literal>
+     is true, both <literal>CommitForeignTransaction</literal> function and
+     <literal>RollbackForeignTransaction</literal> function should commit and
+     rollback directly, rather than processing prepared transactions. This can
+     happen when two-phase commit is not required or foreign server is not
+     modified with in the transaction.
+    </para>
+
+    <para>
+     Once all foreign transaction is prepared, the core transaction manager commits
+     locally. After that the transaction commit waits for all prepared foreign
+     transaction to be committed before completetion. After all prepared foreign
+     transactions are resolved the transaction commit completes.
+    </para>
+
+    <para>
+     One foreign transaction resolver process is responsible for foreign
+     transaction resolution on a database. Foreign transaction resolver process
+     calls either <function>CommitForeignTransaction</function> or
+     <function>RollbackForeignTransaction</function> to resolve foreign
+     transaction identified by <literal>frstate->fdwxact_id</literal>. If failed
+     to resolve, resolver process will exit with an error message. The foreign
+     transaction launcher will launch the resolver process again at
+     <xref linkend="guc-foreign-transaction-resolution-rety-interval"/> interval.
+    </para>
+   </sect2>
+  </sect1>
  </chapter>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 3da2365ea9..80a87fa5d1 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -48,6 +48,7 @@
 <!ENTITY wal           SYSTEM "wal.sgml">
 <!ENTITY logical-replication    SYSTEM "logical-replication.sgml">
 <!ENTITY jit    SYSTEM "jit.sgml">
+<!ENTITY distributed-transaction    SYSTEM "distributed-transaction.sgml">
 
 <!-- programmer's guide -->
 <!ENTITY bgworker   SYSTEM "bgworker.sgml">
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 57a1539506..b9a918b9ee 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -22355,6 +22355,95 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-foreign-transaction">
+   <title>Foreign Transaction Management Functions</title>
+
+   <indexterm>
+    <primary>pg_resolve_foreign_xact</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_remove_foreign_xact</primary>
+   </indexterm>
+
+   <para>
+    <xref linkend="functions-fdw-transaction-control-table"/> shows the functions
+    available for foreign transaction management.
+    These functions cannot be executed during recovery. Use of these function
+    is restricted to superusers.
+   </para>
+
+   <table id="functions-fdw-transaction-control-table">
+    <title>Foreign Transaction Management Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_resolve_foreign_xact(<parameter>transaction</parameter> <type>xid</type>,
<parameter>userid</parameter><type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
 
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>
+        Resolve a foreign transaction. This function searches for foreign
+        transaction matching the arguments and resolves it. Once the foreign
+        transaction is resolved successfully, this function removes the
+        corresponding entry from <xref linkend="view-pg-foreign-xacts"/>.
+        This function won't resolve a foreign transaction which is being
+        processed.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_remove_foreign_xact(<parameter>transaction</parameter> <type>xid</type>,
<parameter>serverid</parameter><type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
 
+       </entry>
+       <entry><type>void</type></entry>
+       <entry>
+        This function works the same as <function>pg_resolve_foreign_xact</function>
+        except that this removes the foreign transcation entry without resolution.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+   The function shown in <xref linkend="functions-fdwxact-resolver-control-table"/>
+   control the foreign transaction resolvers.
+   </para>
+
+   <table id="functions-fdwxact-resolver-control-table">
+    <title>Foreign Transaction Resolver Control Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row>
+       <entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_stop_fdwxact_resolver(<parameter>dbid</parameter>
<type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>
+        Stop the foreign transaction resolver running on the given database.
+        This function is useful for stopping a resolver process on the database
+        that you want to drop.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <para>
+    <function>pg_stop_fdwxact_resolver</function> is useful to be used before
+    dropping the database to that the foreign transaction resolver is connecting.
+   </para>
+
+  </sect2>
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a3c5f86b7e..65938e81ca 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -368,6 +368,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+
<entry><structname>pg_stat_foreign_xact</structname><indexterm><primary>pg_stat_fdw_xact_resolver</primary></indexterm></entry>
+      <entry>One row per foreign transaction resolver process, showing statistics about
+       foreign transaction resolution. See <xref linkend="pg-stat-foreign-xact-view"/> for
+       details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -1236,6 +1244,18 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>CheckpointerMain</literal></entry>
          <entry>Waiting in main loop of checkpointer process.</entry>
         </row>
+        <row>
+         <entry><literal>FdwXactLauncherMain</literal></entry>
+         <entry>Waiting in main loop of foreign transaction resolution launcher process.</entry>
+        </row>
+        <row>
+         <entry><literal>FdwXactResolverMain</literal></entry>
+         <entry>Waiting in main loop of foreign transaction resolution worker process.</entry>
+        </row>
+        <row>
+         <entry><literal>LogicalLauncherMain</literal></entry>
+         <entry>Waiting in main loop of logical launcher process.</entry>
+        </row>
         <row>
          <entry><literal>LogicalApplyMain</literal></entry>
          <entry>Waiting in main loop of logical apply process.</entry>
@@ -1459,6 +1479,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>SyncRep</literal></entry>
          <entry>Waiting for confirmation from remote server during synchronous replication.</entry>
         </row>
+        <row>
+         <entry><literal>FdwXactResolution</literal></entry>
+         <entry>Waiting for all foreign transaction participants to be resolved during atomic commit among foreign
servers.</entry>
+        </row>
         <row>
          <entry morerows="2"><literal>Timeout</literal></entry>
          <entry><literal>BaseBackupThrottle</literal></entry>
@@ -2359,6 +2383,42 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connection.
   </para>
 
+  <table id="pg-stat-foreign-xact-view" xreflabel="pg_stat_foreign_xact">
+   <title><structname>pg_stat_foreign_xact</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of a foreign transaction resolver process</entry>
+    </row>
+    <row>
+     <entry><structfield>dbid</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>OID of the database to which the foreign transaction resolver is connected</entry>
+    </row>
+    <row>
+     <entry><structfield>last_resolved_time</structfield></entry>
+     <entry><type>timestamp with time zone</type></entry>
+     <entry>Time at which the process last resolved a foreign transaction</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_fdw_xact_resolver</structname> view will contain one
+   row per foreign transaction resolver process, showing state of resolution
+   of foreign transactions.
+  </para>
 
   <table id="pg-stat-archiver-view" xreflabel="pg_stat_archiver">
    <title><structname>pg_stat_archiver</structname> View</title>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index e59cba7997..dee3f72f7e 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -163,6 +163,7 @@
   &wal;
   &logical-replication;
   &jit;
+  &distributed-transaction;
   ®ress;
 
  </part>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 1c19e863d2..3f4c806ed1 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -83,6 +83,12 @@ Item
   subsystem</entry>
 </row>
 
+<row>
+ <entry><filename>pg_fdwxact</filename></entry>
+ <entry>Subdirectory containing files used by the distributed transaction
+  manager subsystem</entry>
+</row>
+
 <row>
  <entry><filename>pg_logical</filename></entry>
  <entry>Subdirectory containing status data for logical decoding</entry>
-- 
2.23.0

From 84f81fdcb2bd823e34edba79c81c29871d7906fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:15 +0900
Subject: [PATCH v26 4/5] postgres_fdw supports atomic commit APIs.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 contrib/postgres_fdw/Makefile                 |   7 +-
 contrib/postgres_fdw/connection.c             | 604 +++++++++++-------
 .../postgres_fdw/expected/postgres_fdw.out    | 265 +++++++-
 contrib/postgres_fdw/fdwxact.conf             |   3 +
 contrib/postgres_fdw/postgres_fdw.c           |  21 +-
 contrib/postgres_fdw/postgres_fdw.h           |   7 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql     | 120 +++-
 doc/src/sgml/postgres-fdw.sgml                |  45 ++
 8 files changed, 822 insertions(+), 250 deletions(-)
 create mode 100644 contrib/postgres_fdw/fdwxact.conf

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index ee8a80a392..91fa6e39fc 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -16,7 +16,7 @@ SHLIB_LINK_INTERNAL = $(libpq)
 EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
-REGRESS = postgres_fdw
+REGRESSCHECK = postgres_fdw
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -29,3 +29,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+check:
+    $(pg_regress_check) \
+        --temp-config $(top_srcdir)/contrib/postgres_fdw/fdwxact.conf \
+        $(REGRESSCHECK)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 27b86a03f8..0b07e6c5cc 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * connection.c
- *          Connection management functions for postgres_fdw
+ *          Connection and transaction management functions for postgres_fdw
  *
  * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
  *
@@ -12,6 +12,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "catalog/pg_user_mapping.h"
@@ -54,6 +55,7 @@ typedef struct ConnCacheEntry
     bool        have_error;        /* have any subxacts aborted in this xact? */
     bool        changing_xact_state;    /* xact state change in process */
     bool        invalidated;    /* true if reconnect is pending */
+    bool        xact_got_connection;
     uint32        server_hashvalue;    /* hash value of foreign server OID */
     uint32        mapping_hashvalue;    /* hash value of user mapping OID */
 } ConnCacheEntry;
@@ -67,17 +69,13 @@ static HTAB *ConnectionHash = NULL;
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
-/* tracks whether any work is needed in callback functions */
-static bool xact_got_connection = false;
-
 /* prototypes of private functions */
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values, UserMapping *user);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_subxact_callback(SubXactEvent event,
                                    SubTransactionId mySubid,
                                    SubTransactionId parentSubid,
@@ -89,24 +87,26 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
                                      bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
                                      PGresult **result);
-
+static void pgfdw_end_prepared_xact(ConnCacheEntry *entry, char *fdwxact_id,
+                                    bool is_commit);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
+static ConnCacheEntry *GetConnectionState(Oid umid, bool will_prep_stmt,
+                                          bool start_transaction);
+static ConnCacheEntry *GetConnectionCacheEntry(Oid umid);
 
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
+ * Get connection cache entry. Unlike GetConenctionState function, this function
+ * doesn't establish new connection even if not yet.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
 {
-    bool        found;
     ConnCacheEntry *entry;
-    ConnCacheKey key;
+    ConnCacheKey    key;
+    bool            found;
+
+    /* Create hash key for the entry.  Assume no pad bytes in key struct */
+    key = umid;
 
     /* First time through, initialize connection cache hashtable */
     if (ConnectionHash == NULL)
@@ -126,7 +126,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
          * Register some callback functions that manage connection cleanup.
          * This should be done just once in each backend.
          */
-        RegisterXactCallback(pgfdw_xact_callback, NULL);
         RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
         CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
                                       pgfdw_inval_callback, (Datum) 0);
@@ -134,12 +133,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
                                       pgfdw_inval_callback, (Datum) 0);
     }
 
-    /* Set flag that we did GetConnection during the current transaction */
-    xact_got_connection = true;
-
-    /* Create hash key for the entry.  Assume no pad bytes in key struct */
-    key = user->umid;
-
     /*
      * Find or create cached entry for requested connection.
      */
@@ -153,6 +146,21 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         entry->conn = NULL;
     }
 
+    return entry;
+}
+
+/*
+ * This function gets the connection cache entry and establishes connection
+ * to the foreign server if there is no connection and starts a new transaction
+ * if 'start_transaction' is true.
+ */
+static ConnCacheEntry *
+GetConnectionState(Oid umid, bool will_prep_stmt, bool start_transaction)
+{
+    ConnCacheEntry *entry;
+
+    entry = GetConnectionCacheEntry(umid);
+
     /* Reject further use of connections which failed abort cleanup. */
     pgfdw_reject_incomplete_xact_state_change(entry);
 
@@ -180,6 +188,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
      */
     if (entry->conn == NULL)
     {
+        UserMapping    *user = GetUserMappingByOid(umid);
         ForeignServer *server = GetForeignServer(user->serverid);
 
         /* Reset all transient state fields, to be sure all are clean */
@@ -188,6 +197,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         entry->have_error = false;
         entry->changing_xact_state = false;
         entry->invalidated = false;
+        entry->xact_got_connection = false;
         entry->server_hashvalue =
             GetSysCacheHashValue1(FOREIGNSERVEROID,
                                   ObjectIdGetDatum(server->serverid));
@@ -198,6 +208,15 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
         /* Now try to make the connection */
         entry->conn = connect_pg_server(server, user);
 
+        Assert(entry->conn);
+
+        if (!entry->conn)
+        {
+            elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+                 server->servername);
+            return NULL;
+        }
+
         elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
              entry->conn, server->servername, user->umid, user->userid);
     }
@@ -205,11 +224,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
     /*
      * Start a new transaction or subtransaction if needed.
      */
-    begin_remote_xact(entry);
+    if (start_transaction)
+    {
+        UserMapping        *user = GetUserMappingByOid(umid);
+
+        begin_remote_xact(entry, user->serverid, user->userid);
+
+        /* Set flag that we did GetConnection during the current transaction */
+        entry->xact_got_connection = true;
+    }
 
     /* Remember if caller will prepare statements */
     entry->have_prep_stmt |= will_prep_stmt;
 
+    return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(Oid umid, bool will_prep_stmt, bool start_transaction)
+{
+    ConnCacheEntry *entry;
+
+    entry = GetConnectionState(umid, will_prep_stmt, start_transaction);
+
     return entry->conn;
 }
 
@@ -412,7 +459,7 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
     int            curlevel = GetCurrentTransactionNestLevel();
 
@@ -639,193 +686,6 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
     PG_END_TRY();
 }
 
-/*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
- */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
-{
-    HASH_SEQ_STATUS scan;
-    ConnCacheEntry *entry;
-
-    /* Quick exit if no connections were touched in this transaction. */
-    if (!xact_got_connection)
-        return;
-
-    /*
-     * Scan all connection cache entries to find open remote transactions, and
-     * close them.
-     */
-    hash_seq_init(&scan, ConnectionHash);
-    while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-    {
-        PGresult   *res;
-
-        /* Ignore cache entry if no open connection right now */
-        if (entry->conn == NULL)
-            continue;
-
-        /* If it has an open remote transaction, try to close it */
-        if (entry->xact_depth > 0)
-        {
-            bool        abort_cleanup_failure = false;
-
-            elog(DEBUG3, "closing remote transaction on connection %p",
-                 entry->conn);
-
-            switch (event)
-            {
-                case XACT_EVENT_PARALLEL_PRE_COMMIT:
-                case XACT_EVENT_PRE_COMMIT:
-
-                    /*
-                     * If abort cleanup previously failed for this connection,
-                     * we can't issue any more commands against it.
-                     */
-                    pgfdw_reject_incomplete_xact_state_change(entry);
-
-                    /* Commit all remote transactions during pre-commit */
-                    entry->changing_xact_state = true;
-                    do_sql_command(entry->conn, "COMMIT TRANSACTION");
-                    entry->changing_xact_state = false;
-
-                    /*
-                     * If there were any errors in subtransactions, and we
-                     * made prepared statements, do a DEALLOCATE ALL to make
-                     * sure we get rid of all prepared statements. This is
-                     * annoying and not terribly bulletproof, but it's
-                     * probably not worth trying harder.
-                     *
-                     * DEALLOCATE ALL only exists in 8.3 and later, so this
-                     * constrains how old a server postgres_fdw can
-                     * communicate with.  We intentionally ignore errors in
-                     * the DEALLOCATE, so that we can hobble along to some
-                     * extent with older servers (leaking prepared statements
-                     * as we go; but we don't really support update operations
-                     * pre-8.3 anyway).
-                     */
-                    if (entry->have_prep_stmt && entry->have_error)
-                    {
-                        res = PQexec(entry->conn, "DEALLOCATE ALL");
-                        PQclear(res);
-                    }
-                    entry->have_prep_stmt = false;
-                    entry->have_error = false;
-                    break;
-                case XACT_EVENT_PRE_PREPARE:
-
-                    /*
-                     * We disallow any remote transactions, since it's not
-                     * very reasonable to hold them open until the prepared
-                     * transaction is committed.  For the moment, throw error
-                     * unconditionally; later we might allow read-only cases.
-                     * Note that the error will cause us to come right back
-                     * here with event == XACT_EVENT_ABORT, so we'll clean up
-                     * the connection state at that point.
-                     */
-                    ereport(ERROR,
-                            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                             errmsg("cannot PREPARE a transaction that has operated on postgres_fdw foreign
tables")));
-                    break;
-                case XACT_EVENT_PARALLEL_COMMIT:
-                case XACT_EVENT_COMMIT:
-                case XACT_EVENT_PREPARE:
-                    /* Pre-commit should have closed the open transaction */
-                    elog(ERROR, "missed cleaning up connection during pre-commit");
-                    break;
-                case XACT_EVENT_PARALLEL_ABORT:
-                case XACT_EVENT_ABORT:
-
-                    /*
-                     * Don't try to clean up the connection if we're already
-                     * in error recursion trouble.
-                     */
-                    if (in_error_recursion_trouble())
-                        entry->changing_xact_state = true;
-
-                    /*
-                     * If connection is already unsalvageable, don't touch it
-                     * further.
-                     */
-                    if (entry->changing_xact_state)
-                        break;
-
-                    /*
-                     * Mark this connection as in the process of changing
-                     * transaction state.
-                     */
-                    entry->changing_xact_state = true;
-
-                    /* Assume we might have lost track of prepared statements */
-                    entry->have_error = true;
-
-                    /*
-                     * If a command has been submitted to the remote server by
-                     * using an asynchronous execution function, the command
-                     * might not have yet completed.  Check to see if a
-                     * command is still being processed by the remote server,
-                     * and if so, request cancellation of the command.
-                     */
-                    if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-                        !pgfdw_cancel_query(entry->conn))
-                    {
-                        /* Unable to cancel running query. */
-                        abort_cleanup_failure = true;
-                    }
-                    else if (!pgfdw_exec_cleanup_query(entry->conn,
-                                                       "ABORT TRANSACTION",
-                                                       false))
-                    {
-                        /* Unable to abort remote transaction. */
-                        abort_cleanup_failure = true;
-                    }
-                    else if (entry->have_prep_stmt && entry->have_error &&
-                             !pgfdw_exec_cleanup_query(entry->conn,
-                                                       "DEALLOCATE ALL",
-                                                       true))
-                    {
-                        /* Trouble clearing prepared statements. */
-                        abort_cleanup_failure = true;
-                    }
-                    else
-                    {
-                        entry->have_prep_stmt = false;
-                        entry->have_error = false;
-                    }
-
-                    /* Disarm changing_xact_state if it all worked. */
-                    entry->changing_xact_state = abort_cleanup_failure;
-                    break;
-            }
-        }
-
-        /* Reset state to show we're out of a transaction */
-        entry->xact_depth = 0;
-
-        /*
-         * If the connection isn't in a good idle state, discard it to
-         * recover. Next GetConnection will open a new connection.
-         */
-        if (PQstatus(entry->conn) != CONNECTION_OK ||
-            PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-            entry->changing_xact_state)
-        {
-            elog(DEBUG3, "discarding connection %p", entry->conn);
-            disconnect_pg_server(entry);
-        }
-    }
-
-    /*
-     * Regardless of the event type, we can now mark ourselves as out of the
-     * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-     * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-     */
-    xact_got_connection = false;
-
-    /* Also reset cursor numbering for next transaction */
-    cursor_number = 0;
-}
-
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -842,10 +702,6 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
           event == SUBXACT_EVENT_ABORT_SUB))
         return;
 
-    /* Quick exit if no connections were touched in this transaction. */
-    if (!xact_got_connection)
-        return;
-
     /*
      * Scan all connection cache entries to find open remote subtransactions
      * of the current level, and close them.
@@ -856,6 +712,10 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
     {
         char        sql[100];
 
+        /* Quick exit if no connections were touched in this transaction. */
+        if (!entry->xact_got_connection)
+            continue;
+
         /*
          * We only care about connections with open remote subtransactions of
          * the current level.
@@ -1190,3 +1050,309 @@ exit:    ;
         *result = last_res;
     return timed_out;
 }
+
+/*
+ * Prepare a transaction on foreign server.
+ */
+void
+postgresPrepareForeignTransaction(FdwXactRslvState *state)
+{
+    ConnCacheEntry *entry = NULL;
+    PGresult    *res;
+    StringInfo    command;
+
+    /* The transaction should have started already get the cache entry */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+
+    /* The transaction should have been started */
+    Assert(entry->xact_got_connection && entry->conn);
+
+    pgfdw_reject_incomplete_xact_state_change(entry);
+
+    command = makeStringInfo();
+    appendStringInfo(command, "PREPARE TRANSACTION '%s'", state->fdwxact_id);
+
+    /* Do commit foreign transaction */
+    entry->changing_xact_state = true;
+    res = pgfdw_exec_query(entry->conn, command->data);
+    entry->changing_xact_state = false;
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not prepare transaction on server %s with ID %s",
+                               state->server->servername, state->fdwxact_id)));
+
+    elog(DEBUG1, "prepared foreign transaction on server %s with ID %s",
+         state->server->servername, state->fdwxact_id);
+
+    if (entry->have_prep_stmt && entry->have_error)
+    {
+        res = PQexec(entry->conn, "DEALLOCATE ALL");
+        PQclear(res);
+    }
+
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/*
+ * Commit a transaction or a prepared transaction on foreign server. If
+ * state->flags contains FDWXACT_FLAG_ONEPHASE this function can commit the
+ * foreign transaction without preparation, otherwise commit the prepared
+ * transaction.
+ */
+void
+postgresCommitForeignTransaction(FdwXactRslvState *state)
+{
+    ConnCacheEntry *entry = NULL;
+    bool            is_onephase = (state->flags & FDWXACT_FLAG_ONEPHASE) != 0;
+    PGresult        *res;
+
+    if (!is_onephase)
+    {
+        /*
+         * In two-phase commit case, the foreign transaction has prepared and
+         * closed, so we might not have a connection to it. We get a connection
+         * but don't start transaction.
+         */
+        entry = GetConnectionState(state->usermapping->umid, false, false);
+
+        /* COMMIT PREPARED the transaction */
+        pgfdw_end_prepared_xact(entry, state->fdwxact_id, true);
+        return;
+    }
+
+    /*
+     * In simple commit case, we must have a connection to the foreign server
+     * because the foreign transaction is not closed yet. We get the connection
+     * entry from the cache.
+     */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+    Assert(entry);
+
+    if (!entry->conn || !entry->xact_got_connection)
+        return;
+
+    /*
+     * If abort cleanup previously failed for this connection, we can't issue
+     * any more commands against it.
+     */
+    pgfdw_reject_incomplete_xact_state_change(entry);
+
+    entry->changing_xact_state = true;
+    res = pgfdw_exec_query(entry->conn, "COMMIT TRANSACTION");
+    entry->changing_xact_state = false;
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not commit transaction on server %s",
+                               state->server->servername)));
+
+    /*
+     * If there were any errors in subtransactions, and we ma
+     * made prepared statements, do a DEALLOCATE ALL to make
+     * sure we get rid of all prepared statements. This is
+     * annoying and not terribly bulletproof, but it's
+     * probably not worth trying harder.
+     *
+     * DEALLOCATE ALL only exists in 8.3 and later, so this
+     * constrains how old a server postgres_fdw can
+     * communicate with.  We intentionally ignore errors in
+     * the DEALLOCATE, so that we can hobble along to some
+     * extent with older servers (leaking prepared statements
+     * as we go; but we don't really support update operations
+     * pre-8.3 anyway).
+     */
+    if (entry->have_prep_stmt && entry->have_error)
+    {
+        res = PQexec(entry->conn, "DEALLOCATE ALL");
+        PQclear(res);
+    }
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/*
+ * Rollback a transaction on foreign server. As with commit case, if state->flags
+ * contains FDWAXCT_FLAG_ONEPHASE this function can rollback the foreign
+ * transaction without preparation, other wise rollback the prepared transaction.
+ * This function must tolerate to being called recusively as an error can happen
+ * during aborting.
+ */
+void
+postgresRollbackForeignTransaction(FdwXactRslvState *state)
+{
+    bool            is_onephase = (state->flags & FDWXACT_FLAG_ONEPHASE) != 0;
+    ConnCacheEntry *entry = NULL;
+    bool abort_cleanup_failure = false;
+
+    if (!is_onephase)
+    {
+        /*
+         * In two-phase commit case, the foreign transaction has prepared and
+         * closed, so we might not have a connection to it. We get a connection
+         * but don't start transaction.
+         */
+        entry = GetConnectionState(state->usermapping->umid, false, false);
+
+        /* ROLLBACK PREPARED the transaction */
+        pgfdw_end_prepared_xact(entry, state->fdwxact_id, false);
+        return;
+    }
+
+    /*
+     * In simple rollback case, we must have a connection to the foreign server
+     * because the foreign transaction is not closed yet. We get the connection
+     * entry from the cache.
+     */
+    entry = GetConnectionCacheEntry(state->usermapping->umid);
+    Assert(entry);
+
+    /*
+     * Cleanup connection entry transaction if transaction fails before
+     * establishing a connection or starting transaction.
+     */
+    if (!entry->conn || !entry->xact_got_connection)
+    {
+        pgfdw_cleanup_after_transaction(entry);
+        return;
+    }
+
+    /*
+     * Don't try to clean up the connection if we're already
+     * in error recursion trouble.
+     */
+    if (in_error_recursion_trouble())
+        entry->changing_xact_state = true;
+
+    /*
+     * If connection is before starting transaction or is already unsalvageable,
+     * do only the cleanup and don't touch it further.
+     */
+    if (entry->changing_xact_state || !entry->xact_got_connection)
+    {
+        pgfdw_cleanup_after_transaction(entry);
+        return;
+    }
+
+    /*
+     * Mark this connection as in the process of changing
+     * transaction state.
+     */
+    entry->changing_xact_state = true;
+
+    /* Assume we might have lost track of prepared statements */
+    entry->have_error = true;
+
+    /*
+     * If a command has been submitted to the remote server by
+     * using an asynchronous execution function, the command
+     * might not have yet completed.  Check to see if a
+     * command is still being processed by the remote server,
+     * and if so, request cancellation of the command.
+     */
+    if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
+        !pgfdw_cancel_query(entry->conn))
+    {
+        /* Unable to cancel running query. */
+        abort_cleanup_failure = true;
+    }
+    else if (!pgfdw_exec_cleanup_query(entry->conn,
+                                       "ABORT TRANSACTION",
+                                       false))
+    {
+        /* Unable to abort remote transaction. */
+        abort_cleanup_failure = true;
+    }
+    else if (entry->have_prep_stmt && entry->have_error &&
+             !pgfdw_exec_cleanup_query(entry->conn,
+                                       "DEALLOCATE ALL",
+                                       true))
+    {
+        /* Trouble clearing prepared statements. */
+        abort_cleanup_failure = true;
+    }
+
+    /* Disarm changing_xact_state if it all worked. */
+    entry->changing_xact_state = abort_cleanup_failure;
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+
+    return;
+}
+
+/*
+ * Commit or rollback prepared transaction on the foreign server.
+ */
+static void
+pgfdw_end_prepared_xact(ConnCacheEntry *entry, char *fdwxact_id, bool is_commit)
+{
+    StringInfo    command;
+    PGresult    *res;
+
+    command = makeStringInfo();
+    appendStringInfo(command, "%s PREPARED '%s'",
+                     is_commit ? "COMMIT" : "ROLLBACK",
+                     fdwxact_id);
+
+    res = pgfdw_exec_query(entry->conn, command->data);
+
+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+    {
+        int        sqlstate;
+        char    *diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+
+        if (diag_sqlstate)
+        {
+            sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+                                     diag_sqlstate[1],
+                                     diag_sqlstate[2],
+                                     diag_sqlstate[3],
+                                     diag_sqlstate[4]);
+        }
+        else
+            sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+        /*
+         * As core global transaction manager states, it's possible that the
+         * given foreign transaction doesn't exist on the foreign server. So
+         * we should accept an UNDEFINED_OBJECT error.
+         */
+        if (sqlstate != ERRCODE_UNDEFINED_OBJECT)
+            pgfdw_report_error(ERROR, res, entry->conn, false, command->data);
+    }
+
+    elog(DEBUG1, "%s prepared foreign transaction with ID %s",
+         is_commit ? "commit" : "rollback",
+         fdwxact_id);
+
+    /* Cleanup transaction status */
+    pgfdw_cleanup_after_transaction(entry);
+}
+
+/* Cleanup at main-transaction end */
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+    /* Reset state to show we're out of a transaction */
+    entry->xact_depth = 0;
+    entry->have_prep_stmt = false;
+    entry->have_error  = false;
+    entry->xact_got_connection = false;
+
+    /*
+     * If the connection isn't in a good idle state, discard it to
+     * recover. Next GetConnection will open a new connection.
+     */
+    if (PQstatus(entry->conn) != CONNECTION_OK ||
+        PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+        entry->changing_xact_state)
+    {
+        elog(DEBUG3, "discarding connection %p", entry->conn);
+        disconnect_pg_server(entry);
+    }
+
+    entry->changing_xact_state = false;
+
+    /* Also reset cursor numbering for next transaction */
+    cursor_number = 0;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 48282ab151..0ee91a49ac 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
     OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,13 @@ CREATE TABLE "S 1"."T 4" (
     c3 text,
     CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 -- Disable autovacuum for these tables to avoid unexpected effects of that
 ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');
 ALTER TABLE "S 1"."T 2" SET (autovacuum_enabled = 'false');
@@ -87,6 +99,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -129,6 +142,12 @@ CREATE FOREIGN TABLE ft6 (
     c2 int NOT NULL,
     c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7_2pc (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8_2pc (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -179,15 +198,17 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                              List of foreign tables
- Schema | Table |  Server   |              FDW options              | Description 
---------+-------+-----------+---------------------------------------+-------------
- public | ft1   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
- public | ft2   | loopback  | (schema_name 'S 1', table_name 'T 1') | 
- public | ft4   | loopback  | (schema_name 'S 1', table_name 'T 3') | 
- public | ft5   | loopback  | (schema_name 'S 1', table_name 'T 4') | 
- public | ft6   | loopback2 | (schema_name 'S 1', table_name 'T 4') | 
-(5 rows)
+                               List of foreign tables
+ Schema |  Table  |  Server   |              FDW options              | Description 
+--------+---------+-----------+---------------------------------------+-------------
+ public | ft1     | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft2     | loopback  | (schema_name 'S 1', table_name 'T 1') | 
+ public | ft4     | loopback  | (schema_name 'S 1', table_name 'T 3') | 
+ public | ft5     | loopback  | (schema_name 'S 1', table_name 'T 4') | 
+ public | ft6     | loopback2 | (schema_name 'S 1', table_name 'T 4') | 
+ public | ft7_2pc | loopback  | (schema_name 'S 1', table_name 'T 5') | 
+ public | ft8_2pc | loopback2 | (schema_name 'S 1', table_name 'T 5') | 
+(7 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -8781,16 +8802,226 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 
 -- Clean-up
 RESET enable_partitionwise_aggregate;
--- Two-phase transactions are not supported.
+
+-- ===================================================================
+-- test distributed atomic commit across foreign servers
+-- ===================================================================
+-- Enable atomic commit
+SET foreign_twophase_commit TO 'required';
+-- Modify single foreign server and then commit and rollback.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+(1 row)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+(1 row)
+
+-- Modify two servers then commit and rollback. This requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+COMMIT;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- Modify both local data and 2PC-capable server then commit and rollback.
+-- This also requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  3
+(1 row)
+
 BEGIN;
-SELECT count(*) FROM ft1;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  3
+(1 row)
+
+-- Modify foreign server and raise an error. No data changed.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(4);
+INSERT INTO ft8_2pc VALUES(NULL); -- violation
+ERROR:  null value in column "c1" violates not-null constraint
+DETAIL:  Failing row contains (null).
+CONTEXT:  remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+(4 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES (5);
+INSERT INTO ft8_2pc VALUES (5);
+SAVEPOINT S1;
+INSERT INTO ft7_2pc VALUES (6);
+INSERT INTO ft8_2pc VALUES (6);
+ROLLBACK TO S1;
+COMMIT;
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+(6 rows)
+
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+(6 rows)
+
+RELEASE SAVEPOINT S1;
+ERROR:  RELEASE SAVEPOINT can only be used in transaction blocks
+-- When set to 'disabled', we can commit it
+SET foreign_twophase_commit TO 'disabled';
+BEGIN;
+INSERT INTO ft7_2pc VALUES(8);
+INSERT INTO ft8_2pc VALUES(8);
+COMMIT; -- success
+SELECT * FROM ft7_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+(8 rows)
+
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+(8 rows)
+
+SET foreign_twophase_commit TO 'required';
+-- Commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+  9
+  9
+(10 rows)
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+ c1 
+----
+  1
+  2
+  2
+  3
+  5
+  5
+  8
+  8
+  9
+  9
+(10 rows)
+
+-- No entry remained
+SELECT count(*) FROM pg_foreign_xacts;
  count 
 -------
-   822
+     0
 (1 row)
 
--- error here
-PREPARE TRANSACTION 'fdw_tpc';
-ERROR:  cannot PREPARE a transaction that has operated on postgres_fdw foreign tables
-ROLLBACK;
-WARNING:  there is no transaction in progress
diff --git a/contrib/postgres_fdw/fdwxact.conf b/contrib/postgres_fdw/fdwxact.conf
new file mode 100644
index 0000000000..3fdbf93cdb
--- /dev/null
+++ b/contrib/postgres_fdw/fdwxact.conf
@@ -0,0 +1,3 @@
+max_prepared_transactions = 3
+max_prepared_foreign_transactions = 3
+max_foreign_transaction_resolvers = 2
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index bdc21b36d1..9c63f0aa3b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,7 @@
 
 #include <limits.h>
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "access/table.h"
@@ -504,7 +505,6 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
                               const PgFdwRelationInfo *fpinfo_o,
                               const PgFdwRelationInfo *fpinfo_i);
 
-
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
  * to my callback routines.
@@ -558,6 +558,11 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
     /* Support functions for upper relation push-down */
     routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+    /* Support functions for foreign transactions */
+    routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+    routine->CommitForeignTransaction = postgresCommitForeignTransaction;
+    routine->RollbackForeignTransaction = postgresRollbackForeignTransaction;
+
     PG_RETURN_POINTER(routine);
 }
 
@@ -1434,7 +1439,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    fsstate->conn = GetConnection(user, false);
+    fsstate->conn = GetConnection(user->umid, false, true);
 
     /* Assign a unique ID for my cursor */
     fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -2372,7 +2377,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
      * Get connection to the foreign server.  Connection manager will
      * establish new connection if necessary.
      */
-    dmstate->conn = GetConnection(user, false);
+    dmstate->conn = GetConnection(user->umid, false, true);
 
     /* Update the foreign-join-related fields. */
     if (fsplan->scan.scanrelid == 0)
@@ -2746,7 +2751,7 @@ estimate_path_cost_size(PlannerInfo *root,
                                 false, &retrieved_attrs, NULL);
 
         /* Get the remote estimate */
-        conn = GetConnection(fpinfo->user, false);
+        conn = GetConnection(fpinfo->user->umid, false, true);
         get_remote_estimate(sql.data, conn, &rows, &width,
                             &startup_cost, &total_cost);
         ReleaseConnection(conn);
@@ -3566,7 +3571,7 @@ create_foreign_modify(EState *estate,
     user = GetUserMapping(userid, table->serverid);
 
     /* Open connection; report that we'll create a prepared statement. */
-    fmstate->conn = GetConnection(user, true);
+    fmstate->conn = GetConnection(user->umid, true, true);
     fmstate->p_name = NULL;        /* prepared statement not made yet */
 
     /* Set up remote query information. */
@@ -4441,7 +4446,7 @@ postgresAnalyzeForeignTable(Relation relation,
      */
     table = GetForeignTable(RelationGetRelid(relation));
     user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-    conn = GetConnection(user, false);
+    conn = GetConnection(user->umid, false, true);
 
     /*
      * Construct command to get page count for relation.
@@ -4527,7 +4532,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
     table = GetForeignTable(RelationGetRelid(relation));
     server = GetForeignServer(table->serverid);
     user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-    conn = GetConnection(user, false);
+    conn = GetConnection(user->umid, false, true);
 
     /*
      * Construct cursor that retrieves whole rows from remote.
@@ -4755,7 +4760,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
      */
     server = GetForeignServer(serverOid);
     mapping = GetUserMapping(GetUserId(), server->serverid);
-    conn = GetConnection(mapping, false);
+    conn = GetConnection(mapping->umid, false, true);
 
     /* Don't attempt to import collation if remote server hasn't got it */
     if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index ea052872c3..d7ba45c8d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdwxact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "libpq-fe.h"
@@ -129,7 +130,7 @@ extern int    set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(Oid umid, bool will_prep_stmt, bool start_transaction);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -137,6 +138,9 @@ extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
 extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
                                bool clear, const char *sql);
+extern void postgresPrepareForeignTransaction(FdwXactRslvState *state);
+extern void postgresCommitForeignTransaction(FdwXactRslvState *state);
+extern void postgresRollbackForeignTransaction(FdwXactRslvState *state);
 
 /* in option.c */
 extern int    ExtractConnectionOptions(List *defelems,
@@ -203,6 +207,7 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
                                     bool is_subquery,
                                     List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern bool server_uses_twophase_commit(ForeignServer *server);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 1c5c37b783..572077c57c 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
     OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,14 @@ CREATE TABLE "S 1"."T 4" (
     c3 text,
     CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 -- Disable autovacuum for these tables to avoid unexpected effects of that
 ALTER TABLE "S 1"."T 1" SET (autovacuum_enabled = 'false');
@@ -94,6 +107,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -142,6 +156,15 @@ CREATE FOREIGN TABLE ft6 (
     c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7_2pc (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8_2pc (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- ===================================================================
 -- tests for validator
 -- ===================================================================
@@ -2480,9 +2503,98 @@ SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING sum(a) < 700
 -- Clean-up
 RESET enable_partitionwise_aggregate;
 
--- Two-phase transactions are not supported.
+-- ===================================================================
+-- test distributed atomic commit across foreign servers
+-- ===================================================================
+
+-- Enable atomic commit
+SET foreign_twophase_commit TO 'required';
+
+-- Modify single foreign server and then commit and rollback.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(1);
+COMMIT;
+SELECT * FROM ft7_2pc;
+
 BEGIN;
-SELECT count(*) FROM ft1;
--- error here
-PREPARE TRANSACTION 'fdw_tpc';
+INSERT INTO ft7_2pc VALUES(1);
 ROLLBACK;
+SELECT * FROM ft7_2pc;
+
+-- Modify two servers then commit and rollback. This requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+COMMIT;
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(2);
+INSERT INTO ft8_2pc VALUES(2);
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+
+-- Modify both local data and 2PC-capable server then commit and rollback.
+-- This also requires to use 2PC.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT * FROM ft7_2pc;
+SELECT * FROM "S 1"."T 6";
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(3);
+INSERT INTO "S 1"."T 6" VALUES (3);
+ROLLBACK;
+SELECT * FROM ft7_2pc;
+SELECT * FROM "S 1"."T 6";
+
+-- Modify foreign server and raise an error. No data changed.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(4);
+INSERT INTO ft8_2pc VALUES(NULL); -- violation
+ROLLBACK;
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES (5);
+INSERT INTO ft8_2pc VALUES (5);
+SAVEPOINT S1;
+INSERT INTO ft7_2pc VALUES (6);
+INSERT INTO ft8_2pc VALUES (6);
+ROLLBACK TO S1;
+COMMIT;
+SELECT * FROM ft7_2pc;
+SELECT * FROM ft8_2pc;
+RELEASE SAVEPOINT S1;
+
+-- When set to 'disabled', we can commit it
+SET foreign_twophase_commit TO 'disabled';
+BEGIN;
+INSERT INTO ft7_2pc VALUES(8);
+INSERT INTO ft8_2pc VALUES(8);
+COMMIT; -- success
+SELECT * FROM ft7_2pc;
+SELECT * FROM ft8_2pc;
+
+SET foreign_twophase_commit TO 'required';
+
+-- Commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+
+BEGIN;
+INSERT INTO ft7_2pc VALUES(9);
+INSERT INTO ft8_2pc VALUES(9);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_2pc;
+
+-- No entry remained
+SELECT count(*) FROM pg_foreign_xacts;
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 1d4bafd9f0..362f7be9e3 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -441,6 +441,43 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be specified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"/> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions"/>
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </sect3>
  </sect2>
 
  <sect2>
@@ -468,6 +505,14 @@
    managed by creating corresponding remote savepoints.
   </para>
 
+  <para>
+   <filename>postgrs_fdw</filename> uses two-phase commit protocol during
+   transaction commits or aborts when the atomic commit of distributed
+   transaction (see <xref linkend="atomic-commit"/>) is required. So the remote
+   server should set <xref linkend="guc-max-prepared-transactions"/> more
+   than one so that it can prepare the remote transaction.
+  </para>
+
   <para>
    The remote transaction uses <literal>SERIALIZABLE</literal>
    isolation level when the local transaction has <literal>SERIALIZABLE</literal>
-- 
2.23.0

From 639d9156323594430ec4b2217a95bfcf08195e9d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Thu, 5 Dec 2019 17:01:26 +0900
Subject: [PATCH v26 5/5] Add regression tests for atomic commit.

Original Author: Masahiko Sawada <sawada.mshk@gmail.com>
---
 src/test/recovery/Makefile         |   2 +-
 src/test/recovery/t/016_fdwxact.pl | 175 +++++++++++++++++++++++++++++
 src/test/regress/pg_regress.c      |  13 ++-
 3 files changed, 185 insertions(+), 5 deletions(-)
 create mode 100644 src/test/recovery/t/016_fdwxact.pl

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index e66e69521f..b17429f501 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pageinspect contrib/postgres_fdw
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/016_fdwxact.pl b/src/test/recovery/t/016_fdwxact.pl
new file mode 100644
index 0000000000..9af9bb81dc
--- /dev/null
+++ b/src/test/recovery/t/016_fdwxact.pl
@@ -0,0 +1,175 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
+max_foreign_transaction_resolvers = 2
+foreign_transaction_resolver_timeout = 0
+foreign_transaction_resolution_retry_interval = 5s
+foreign_twophase_commit = on
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+                               has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs2->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign servers on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE EXTENSION postgres_fdw
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port');
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port');
+));
+
+# Create user mapping on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+));
+
+# Create tables on foreign nodes and import them to the master node
+$node_fs1->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 (c int);
+));
+$node_fs2->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 (c int);
+));
+$node_master->safe_psql('postgres', qq(
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE l_table (c int);
+));
+
+# Switch to synchronous replication
+$node_master->safe_psql('postgres', qq(
+ALTER SYSTEM SET synchronous_standby_names ='*';
+));
+$node_master->reload;
+
+my $result;
+
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node. Check if we can commit and rollback the foreign transactions
+# after the normal recovery.
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->stop;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after recovery');
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node immediately. Check if we can commit and rollback the foreign
+# transactions after the crash recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (3);
+INSERT INTO t2 VALUES (3);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (4);
+INSERT INTO t2 VALUES (4);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the crash recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after crash recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after crash recovery');
+
+#
+# Commit transaction involving foreign servers and shutdown the master node
+# immediately before checkpoint. Check that WAL replay cleans up
+# its shared memory state release locks while replaying transaction commit.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (5);
+INSERT INTO t2 VALUES (5);
+COMMIT;
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', qq(
+SELECT count(*) FROM pg_foreign_xacts;
+));
+is($result, 0, "Cleanup of shared memory state for foreign transactions");
+
+#
+# Check if the standby node can process prepared foreign transaction
+# after promotion.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (6);
+INSERT INTO t2 VALUES (6);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (7);
+INSERT INTO t2 VALUES (7);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', qq(COMMIT PREPARED 'gxid1';));
+is($result, 0, 'Commit foreign transaction after promotion');
+$result = $node_standby->psql('postgres', qq(ROLLBACK PREPARED 'gxid2';));
+is($result, 0, 'Rollback foreign transaction after promotion');
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 297b8fbd6f..82a1e7d541 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2336,9 +2336,12 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
          * Adjust the default postgresql.conf for regression testing. The user
          * can specify a file to be appended; in any case we expand logging
          * and set max_prepared_transactions to enable testing of prepared
-         * xacts.  (Note: to reduce the probability of unexpected shmmax
-         * failures, don't set max_prepared_transactions any higher than
-         * actually needed by the prepared_xacts regression test.)
+         * xacts.  We also set max_prepared_foreign_transactions and
+         * max_foreign_transaction_resolvers to enable testing of transaction
+         * involving multiple foreign servers. (Note: to reduce the probability
+         * of unexpected shmmax failures, don't set max_prepared_transactions
+         * any higher than actually needed by the prepared_xacts regression
+         * test.)
          */
         snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
         pg_conf = fopen(buf, "a");
@@ -2353,7 +2356,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
         fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
         fputs("log_lock_waits = on\n", pg_conf);
         fputs("log_temp_files = 128kB\n", pg_conf);
-        fputs("max_prepared_transactions = 2\n", pg_conf);
+        fputs("max_prepared_transactions = 3\n", pg_conf);
+        fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
+        fputs("max_foreign_transaction_resolvers = 2\n", pg_conf);
 
         for (sl = temp_configs; sl != NULL; sl = sl->next)
         {
-- 
2.23.0

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

24 January 2020, 06:00:50

On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> This is the reased (and a bit fixed) version of the patch. This
> applies on the master HEAD and passes all provided tests.
>
> I took over this work from Sawada-san. I'll begin with reviewing the
> current patch.
>

The previous patch set is no longer applied cleanly to the current
HEAD. I've updated and slightly modified the codes.

This patch set has been marked as Waiting on Author for a long time
but the correct status now is Needs Review. The patch actually was
updated and incorporated all review comments but they was not rebased
actively.

The mail[1] I posted before would be helpful to understand the current
patch design and there are README in the patch and a wiki page[2].

I've marked this as Needs Review.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDn98axH1bEoMnte%2BS7WWR%3DnsmOpjz1WGH-NvJi4aLu3Q%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

amul sul

Date:

11 February 2020, 03:41:27

On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> This is the reased (and a bit fixed) version of the patch. This
> applies on the master HEAD and passes all provided tests.
>
> I took over this work from Sawada-san. I'll begin with reviewing the
> current patch.
>

The previous patch set is no longer applied cleanly to the current
HEAD. I've updated and slightly modified the codes.

This patch set has been marked as Waiting on Author for a long time
but the correct status now is Needs Review. The patch actually was
updated and incorporated all review comments but they was not rebased
actively.

The mail[1] I posted before would be helpful to understand the current
patch design and there are README in the patch and a wiki page[2].

I've marked this as Needs Review.

Hi Sawada san,

I just had a quick look to 0001 and 0002 patch here is the few suggestions.

patch: v27-0001:

Typo: s/non-temprary/non-temporary
----

patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):

138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
139 +participant The registered foreign transactions are tracked until the end of

Full stop "." is missing after "participant"

174 +API Contract With Transaction Management Callback Functions

Can we just say "Transaction Management Callback Functions";

TOBH, I am not sure that I understand this title.

203 +processing foreign transaction (i.g. preparing, committing or aborting) the

Do you mean "i.e" instead of i.g. ?

269 + * RollbackForeignTransactionAPI. Registered participant servers are identified

Add space before between RollbackForeignTransaction and API.

292 + * automatically so must be processed manually using by pg_resovle_fdwxact()

Do you mean pg_resolve_foreign_xact() here?

320 + * the foreign transaction is authorized to update the fields from its own
321 + * one.
322 +
323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a

Please add asterisk '*' on line#322.

816 +static void
817 +FdwXactPrepareForeignTransactions(void)
818 +{
819 + ListCell *lcell;

Let's have this variable name as "lc" like elsewhere.

1036 + ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
1037 + errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
1038 + xid, serverid, userid)));
1039 + }

Incorrect formatting.

1166 +/*
1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction

Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?

3529 +
3530 +/*
3531 + * FdwXactLauncherRegister
3532 + * Register a background worker running the foreign transaction
3533 + * launcher.
3534 + */

This prolog style is not consistent with the other function in the file.

And here are the few typos:

s/conssitent/consistent
s/consisnts/consist
s/Foriegn/Foreign
s/tranascation/transaction
s/itselft/itself
s/rolbacked/rollbacked
s/trasaction/transaction
s/transactio/transaction
s/automically/automatically
s/CommitForeignTransaciton/CommitForeignTransaction
s/Similary/Similarly
s/FDWACT_/FDWXACT_
s/dink/disk
s/requried/required
s/trasactions/transactions
s/prepread/prepared
s/preapred/prepared
s/beging/being
s/gxact/xact
s/in-dbout/in-doubt
s/respecitively/respectively
s/transction/transaction
s/idenetifier/identifier
s/identifer/identifier
s/checkpoint'S/checkpoint's
s/fo/of
s/transcation/transaction
s/trasanction/transaction
s/non-temprary/non-temporary
s/resovler_internal.h/resolver_internal.h

Regards,

Amul

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Muhammad Usama

Date:

17 February 2020, 15:39:11

Hi Sawada San,

I have a couple of comments on "v27-0002-Support-atomic-commit-among-multiple-foreign-ser.patch"

1-  As part of the XLogReadRecord refactoring commit the signature of XLogReadRecord was changed,
so the function call to XLogReadRecord() needs a small adjustment.

i.e. In function XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
...
-       record = XLogReadRecord(xlogreader, lsn, &errormsg);
+       XLogBeginRead(xlogreader, lsn)
+       record = XLogReadRecord(xlogreader, &errormsg);

2- In register_fdwxact(..) function you are setting the XACT_FLAGS_FDWNOPREPARE transaction flag
when the register request comes in for foreign server that does not support two-phase commit regardless
of the value of 'bool modified' argument. And later in the PreCommit_FdwXacts() you just error out when
"foreign_twophase_commit" is set to 'required' only by looking at the XACT_FLAGS_FDWNOPREPARE flag.
which I think is not correct.
Since there is a possibility that the transaction might have only read from the foreign servers (not capable of
handling transactions or two-phase commit) and all other servers where we require to do atomic commit
are capable enough of doing so.
If I am not missing something obvious here, then IMHO the XACT_FLAGS_FDWNOPREPARE flag should only
be set when the transaction management/two-phase functionality is not available and "modified" argument is
true in register_fdwxact()

Thanks

Best regards
Muhammad Usama
Highgo Software (Canada/China/Pakistan)

The new status of this patch is: Waiting on Author

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

18 February 2020, 22:55:39

On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
>
> On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>> >
>> > Hello.
>> >
>> > This is the reased (and a bit fixed) version of the patch. This
>> > applies on the master HEAD and passes all provided tests.
>> >
>> > I took over this work from Sawada-san. I'll begin with reviewing the
>> > current patch.
>> >
>>
>> The previous patch set is no longer applied cleanly to the current
>> HEAD. I've updated and slightly modified the codes.
>>
>> This patch set has been marked as Waiting on Author for a long time
>> but the correct status now is Needs Review. The patch actually was
>> updated and incorporated all review comments but they was not rebased
>> actively.
>>
>> The mail[1] I posted before would be helpful to understand the current
>> patch design and there are README in the patch and a wiki page[2].
>>
>> I've marked this as Needs Review.
>>
>
> Hi Sawada san,
>
> I just had a quick look to 0001 and 0002 patch here is the few suggestions.
>
> patch: v27-0001:
>
> Typo: s/non-temprary/non-temporary
> ----
>
> patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
>
>  138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
>  139 +participant The registered foreign transactions are tracked until the end of
>
> Full stop "." is missing after "participant"
>
>
> 174 +API Contract With Transaction Management Callback Functions
>
> Can we just say "Transaction Management Callback Functions";
> TOBH, I am not sure that I understand this title.
>
>
>  203 +processing foreign transaction (i.g. preparing, committing or aborting) the
>
> Do you mean "i.e" instead of i.g. ?
>
>
> 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
>
> Add space before between RollbackForeignTransaction and API.
>
>
>  292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
>
> Do you mean pg_resolve_foreign_xact() here?
>
>
>  320 + *   the foreign transaction is authorized to update the fields from its own
>  321 + *   one.
>  322 +
>  323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
>
> Please add asterisk '*' on line#322.
>
>
>  816 +static void
>  817 +FdwXactPrepareForeignTransactions(void)
>  818 +{
>  819 +   ListCell        *lcell;
>
> Let's have this variable name as "lc" like elsewhere.
>
>
> 1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> 1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> 1038 +                                  xid, serverid, userid)));
> 1039 +   }
>
> Incorrect formatting.
>
>
> 1166 +/*
> 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
>
> Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
>
>
> 3529 +
> 3530 +/*
> 3531 + * FdwXactLauncherRegister
> 3532 + *     Register a background worker running the foreign transaction
> 3533 + *      launcher.
> 3534 + */
>
> This prolog style is not consistent with the other function in the file.
>
>
> And here are the few typos:
>
> s/conssitent/consistent
> s/consisnts/consist
> s/Foriegn/Foreign
> s/tranascation/transaction
> s/itselft/itself
> s/rolbacked/rollbacked
> s/trasaction/transaction
> s/transactio/transaction
> s/automically/automatically
> s/CommitForeignTransaciton/CommitForeignTransaction
> s/Similary/Similarly
> s/FDWACT_/FDWXACT_
> s/dink/disk
> s/requried/required
> s/trasactions/transactions
> s/prepread/prepared
> s/preapred/prepared
> s/beging/being
> s/gxact/xact
> s/in-dbout/in-doubt
> s/respecitively/respectively
> s/transction/transaction
> s/idenetifier/identifier
> s/identifer/identifier
> s/checkpoint'S/checkpoint's
> s/fo/of
> s/transcation/transaction
> s/trasanction/transaction
> s/non-temprary/non-temporary
> s/resovler_internal.h/resolver_internal.h
>
>

Thank you for reviewing the patch! I've incorporated all comments in
local branch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

22 February 2020, 02:14:58

On Wed, 19 Feb 2020 at 07:55, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >> >
> >> > Hello.
> >> >
> >> > This is the reased (and a bit fixed) version of the patch. This
> >> > applies on the master HEAD and passes all provided tests.
> >> >
> >> > I took over this work from Sawada-san. I'll begin with reviewing the
> >> > current patch.
> >> >
> >>
> >> The previous patch set is no longer applied cleanly to the current
> >> HEAD. I've updated and slightly modified the codes.
> >>
> >> This patch set has been marked as Waiting on Author for a long time
> >> but the correct status now is Needs Review. The patch actually was
> >> updated and incorporated all review comments but they was not rebased
> >> actively.
> >>
> >> The mail[1] I posted before would be helpful to understand the current
> >> patch design and there are README in the patch and a wiki page[2].
> >>
> >> I've marked this as Needs Review.
> >>
> >
> > Hi Sawada san,
> >
> > I just had a quick look to 0001 and 0002 patch here is the few suggestions.
> >
> > patch: v27-0001:
> >
> > Typo: s/non-temprary/non-temporary
> > ----
> >
> > patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
> >
> >  138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
> >  139 +participant The registered foreign transactions are tracked until the end of
> >
> > Full stop "." is missing after "participant"
> >
> >
> > 174 +API Contract With Transaction Management Callback Functions
> >
> > Can we just say "Transaction Management Callback Functions";
> > TOBH, I am not sure that I understand this title.
> >
> >
> >  203 +processing foreign transaction (i.g. preparing, committing or aborting) the
> >
> > Do you mean "i.e" instead of i.g. ?
> >
> >
> > 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
> >
> > Add space before between RollbackForeignTransaction and API.
> >
> >
> >  292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
> >
> > Do you mean pg_resolve_foreign_xact() here?
> >
> >
> >  320 + *   the foreign transaction is authorized to update the fields from its own
> >  321 + *   one.
> >  322 +
> >  323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
> >
> > Please add asterisk '*' on line#322.
> >
> >
> >  816 +static void
> >  817 +FdwXactPrepareForeignTransactions(void)
> >  818 +{
> >  819 +   ListCell        *lcell;
> >
> > Let's have this variable name as "lc" like elsewhere.
> >
> >
> > 1036 +           ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> > 1037 +                           errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> > 1038 +                                  xid, serverid, userid)));
> > 1039 +   }
> >
> > Incorrect formatting.
> >
> >
> > 1166 +/*
> > 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
> >
> > Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
> >
> >
> > 3529 +
> > 3530 +/*
> > 3531 + * FdwXactLauncherRegister
> > 3532 + *     Register a background worker running the foreign transaction
> > 3533 + *      launcher.
> > 3534 + */
> >
> > This prolog style is not consistent with the other function in the file.
> >
> >
> > And here are the few typos:
> >
> > s/conssitent/consistent
> > s/consisnts/consist
> > s/Foriegn/Foreign
> > s/tranascation/transaction
> > s/itselft/itself
> > s/rolbacked/rollbacked
> > s/trasaction/transaction
> > s/transactio/transaction
> > s/automically/automatically
> > s/CommitForeignTransaciton/CommitForeignTransaction
> > s/Similary/Similarly
> > s/FDWACT_/FDWXACT_
> > s/dink/disk
> > s/requried/required
> > s/trasactions/transactions
> > s/prepread/prepared
> > s/preapred/prepared
> > s/beging/being
> > s/gxact/xact
> > s/in-dbout/in-doubt
> > s/respecitively/respectively
> > s/transction/transaction
> > s/idenetifier/identifier
> > s/identifer/identifier
> > s/checkpoint'S/checkpoint's
> > s/fo/of
> > s/transcation/transaction
> > s/trasanction/transaction
> > s/non-temprary/non-temporary
> > s/resovler_internal.h/resolver_internal.h
> >
> >
>
> Thank you for reviewing the patch! I've incorporated all comments in
> local branch.

Attached the updated version patch sets that incorporated review
comments from Amul and Muhammad.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Transactions involving multiple postgres foreignservers, take 2

From

Masahiko Sawada

Date:

22 February 2020, 03:05:14

On Tue, 18 Feb 2020 at 00:40, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have a couple of comments on "v27-0002-Support-atomic-commit-among-multiple-foreign-ser.patch"
>
> 1-  As part of the XLogReadRecord refactoring commit the signature of XLogReadRecord was changed,
> so the function call to XLogReadRecord() needs a small adjustment.
>
> i.e. In function XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
> ...
> -       record = XLogReadRecord(xlogreader, lsn, &errormsg);
> +       XLogBeginRead(xlogreader, lsn)
> +       record = XLogReadRecord(xlogreader, &errormsg);
>
> 2- In register_fdwxact(..) function you are setting the XACT_FLAGS_FDWNOPREPARE transaction flag
> when the register request comes in for foreign server that does not support two-phase commit regardless
> of the value of 'bool modified' argument. And later in the PreCommit_FdwXacts() you just error out when
> "foreign_twophase_commit" is set to 'required' only by looking at the XACT_FLAGS_FDWNOPREPARE flag.
> which I think is not correct.
> Since there is a possibility that the transaction might have only read from the foreign servers (not capable of
> handling transactions or two-phase commit) and all other servers where we require to do atomic commit
> are capable enough of doing so.
> If I am not missing something obvious here, then IMHO the XACT_FLAGS_FDWNOPREPARE flag should only
> be set when the transaction management/two-phase functionality is not available and "modified" argument is
> true in register_fdwxact()
>

Thank you for reviewing this patch!

Your comments are incorporated in the latest patch set I recently sent[1].

[1] https://www.postgresql.org/message-id/CA%2Bfd4k5ZcDvoiY_5c-mF1oDACS5nUWS7ppoiOwjCOnM%2BgrJO-Q%40mail.gmail.com

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

27 March 2020, 13:06:14

On Sat, Feb 22, 2020 at 7:15 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Wed, 19 Feb 2020 at 07:55, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 11 Feb 2020 at 12:42, amul sul <sulamul@gmail.com> wrote:
> >
> > On Fri, Jan 24, 2020 at 11:31 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 6 Dec 2019 at 17:33, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >> >
> >> > Hello.
> >> >
> >> > This is the reased (and a bit fixed) version of the patch. This
> >> > applies on the master HEAD and passes all provided tests.
> >> >
> >> > I took over this work from Sawada-san. I'll begin with reviewing the
> >> > current patch.
> >> >
> >>
> >> The previous patch set is no longer applied cleanly to the current
> >> HEAD. I've updated and slightly modified the codes.
> >>
> >> This patch set has been marked as Waiting on Author for a long time
> >> but the correct status now is Needs Review. The patch actually was
> >> updated and incorporated all review comments but they was not rebased
> >> actively.
> >>
> >> The mail[1] I posted before would be helpful to understand the current
> >> patch design and there are README in the patch and a wiki page[2].
> >>
> >> I've marked this as Needs Review.
> >>
> >
> > Hi Sawada san,
> >
> > I just had a quick look to 0001 and 0002 patch here is the few suggestions.
> >
> > patch: v27-0001:
> >
> > Typo: s/non-temprary/non-temporary
> > ----
> >
> > patch: v27-0002: (Note:The left-hand number is the line number in the v27-0002 patch):
> >
> > 138 +PostgreSQL's the global transaction manager (GTM), as a distributed transaction
> > 139 +participant The registered foreign transactions are tracked until the end of
> >
> > Full stop "." is missing after "participant"
> >
> >
> > 174 +API Contract With Transaction Management Callback Functions
> >
> > Can we just say "Transaction Management Callback Functions";
> > TOBH, I am not sure that I understand this title.
> >
> >
> > 203 +processing foreign transaction (i.g. preparing, committing or aborting) the
> >
> > Do you mean "i.e" instead of i.g. ?
> >
> >
> > 269 + * RollbackForeignTransactionAPI. Registered participant servers are identified
> >
> > Add space before between RollbackForeignTransaction and API.
> >
> >
> > 292 + * automatically so must be processed manually using by pg_resovle_fdwxact()
> >
> > Do you mean pg_resolve_foreign_xact() here?
> >
> >
> > 320 + * the foreign transaction is authorized to update the fields from its own
> > 321 + * one.
> > 322 +
> > 323 + * Therefore, before doing PREPARE, COMMIT PREPARED or ROLLBACK PREPARED a
> >
> > Please add asterisk '*' on line#322.
> >
> >
> > 816 +static void
> > 817 +FdwXactPrepareForeignTransactions(void)
> > 818 +{
> > 819 + ListCell *lcell;
> >
> > Let's have this variable name as "lc" like elsewhere.
> >
> >
> > 1036 + ereport(ERROR, (errmsg("could not insert a foreign transaction entry"),
> > 1037 + errdetail("duplicate entry with transaction id %u, serverid %u, userid %u",
> > 1038 + xid, serverid, userid)));
> > 1039 + }
> >
> > Incorrect formatting.
> >
> >
> > 1166 +/*
> > 1167 + * Return true and set FdwXactAtomicCommitReady to true if the current transaction
> >
> > Do you mean ForeignTwophaseCommitIsRequired instead of FdwXactAtomicCommitReady?
> >
> >
> > 3529 +
> > 3530 +/*
> > 3531 + * FdwXactLauncherRegister
> > 3532 + * Register a background worker running the foreign transaction
> > 3533 + * launcher.
> > 3534 + */
> >
> > This prolog style is not consistent with the other function in the file.
> >
> >
> > And here are the few typos:
> >
> > s/conssitent/consistent
> > s/consisnts/consist
> > s/Foriegn/Foreign
> > s/tranascation/transaction
> > s/itselft/itself
> > s/rolbacked/rollbacked
> > s/trasaction/transaction
> > s/transactio/transaction
> > s/automically/automatically
> > s/CommitForeignTransaciton/CommitForeignTransaction
> > s/Similary/Similarly
> > s/FDWACT_/FDWXACT_
> > s/dink/disk
> > s/requried/required
> > s/trasactions/transactions
> > s/prepread/prepared
> > s/preapred/prepared
> > s/beging/being
> > s/gxact/xact
> > s/in-dbout/in-doubt
> > s/respecitively/respectively
> > s/transction/transaction
> > s/idenetifier/identifier
> > s/identifer/identifier
> > s/checkpoint'S/checkpoint's
> > s/fo/of
> > s/transcation/transaction
> > s/trasanction/transaction
> > s/non-temprary/non-temporary
> > s/resovler_internal.h/resolver_internal.h
> >
> >
>
> Thank you for reviewing the patch! I've incorporated all comments in
> local branch.

Attached the updated version patch sets that incorporated review
comments from Amul and Muhammad.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Sawada San,

I have been further reviewing and testing the transaction involving multiple server patches.

Overall the patches are working as expected bar a few important exceptions.

So as discussed over the call I have fixed the issues I found during the testing

and also rebased the patches with the current head of the master branch.

So can you please have a look at the attached updated patches.

Below is the list of changes I have made on top of V18 patches.

1- In register_fdwxact(), As we are just storing the callback function pointers from

FdwRoutine in fdw_part structure, So I think we can avoid calling

GetFdwRoutineByServerId() in TopMemoryContext.
So I have moved the MemoryContextSwitch to TopMemoryContext after the

GetFdwRoutineByServerId() call.

2- If PrepareForeignTransaction functionality is not present in some FDW then

during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE

transaction flag if the modified flag is also set for that server. As for the server that has

not done any data modification within the transaction we do not do two-phase commit anyway.

3- I have moved the foreign_twophase_commit in sample file after

max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers

is 0 and enabling the foreign_twophase_commit produces an error with default

configuration parameter positioning in postgresql.conf

Also, foreign_twophase_commit configuration was missing the comments

about allowed values in the sample config file.

4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()

function does not seem to be the correct place. The reason being, even when

is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired

to true, we could still end up not using the two-phase commit in the case when some server does

not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER

mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()

function after doing the prepare transaction.

6- In prefer mode, we commit the transaction in single-phase if the server does not support

the two-phase commit. But instead of doing the single-phase commit right away,

IMHO the better way is to wait until all the two-phase transactions are successfully prepared

on servers that support the two-phase. Since an error during a "PREPARE" stage would

rollback the transaction and in that case, we would end up with committed transactions on

the server that lacks the support of the two-phase commit.
So I have modified the flow a little bit and instead of doing a one-phase commit right away

the servers that do not support a two-phase commit is added to another list and that list is

processed after once we have successfully prepared all the transactions on two-phase supported

foreign servers. Although this technique is also not bulletproof, still it is better than doing

the one-phase commits before doing the PREPAREs.

Also, I think we can improve on this one by throwing an error even in PREFER

mode if there is more than one server that had data modified within the transaction

and lacks the two-phase commit support.

7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the

memory if fdw_part is removed from the list

8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds

(FdwXactParticipants == NIL). The problem with that was in the case of

"COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and

effectively the foreign prepared transactions(if any) associated with locally

prepared transactions were never getting resolved automatically.

postgres=# BEGIN;
BEGIN

INSERT INTO test_local VALUES ( 2, 'TWO');

INSERT 0 1

INSERT INTO test_foreign_s1 VALUES ( 2, 'TWO');
INSERT 0 1

INSERT INTO test_foreign_s2 VALUES ( 2, 'TWO');
INSERT 0 1
postgres=*# PREPARE TRANSACTION 'local_prepared';
PREPARE TRANSACTION

postgres=# select * from pg_foreign_xacts ;

dbid | xid | serverid | userid | status | in_doubt | identifier
-------+-----+----------+--------+----------+----------+----------------------------
12929 | 515 | 16389 | 10 | prepared | f | fx_1339567411_515_16389_10
12929 | 515 | 16391 | 10 | prepared | f | fx_1963224020_515_16391_10
(2 rows)

-- Now commit the prepared transaction

postgres=# COMMIT PREPARED 'local_prepared';

COMMIT PREPARED

--Foreign prepared transactions associated with 'local_prepared' not resolved

postgres=#

postgres=# select * from pg_foreign_xacts ;

So to fix this in case of the two-phase transaction, the function checks the existence

of associated foreign prepared transactions before bailing out.

9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()

that was causing the crash during recovery.

10- incorporated set_ps_display() signature change.

Best regards,

...

Muhammad Usama

HighGo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

08 April 2020, 06:16:17

On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have been further reviewing and testing the transaction involving multiple server patches.
> Overall the patches are working as expected bar a few important exceptions.
> So as discussed over the call I have fixed the issues I found during the testing
> and also rebased the patches with the current head of the master branch.
> So can you please have a look at the attached updated patches.

Thank you for reviewing and updating the patch!

>
> Below is the list of changes I have made on top of V18 patches.
>
> 1- In register_fdwxact(), As we are just storing the callback function pointers from
> FdwRoutine in fdw_part structure, So I think we can avoid calling
> GetFdwRoutineByServerId() in TopMemoryContext.
> So I have moved the MemoryContextSwitch to TopMemoryContext after the
> GetFdwRoutineByServerId() call.

Agreed.

>
>
> 2- If PrepareForeignTransaction functionality is not present in some FDW then
> during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> transaction flag if the modified flag is also set for that server. As for the server that has
> not done any data modification within the transaction we do not do two-phase commit anyway.

Agreed.

>
> 3- I have moved the foreign_twophase_commit in sample file after
> max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> is 0 and enabling the foreign_twophase_commit produces an error with default
> configuration parameter positioning in postgresql.conf
> Also, foreign_twophase_commit configuration was missing the comments
> about allowed values in the sample config file.

Sounds good. Agreed.

>
> 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> function does not seem to be the correct place. The reason being, even when
> is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> to true, we could still end up not using the two-phase commit in the case when some server does
> not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> function after doing the prepare transaction.

Agreed.

>
> 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> the two-phase commit. But instead of doing the single-phase commit right away,
> IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> on servers that support the two-phase. Since an error during a "PREPARE" stage would
> rollback the transaction and in that case, we would end up with committed transactions on
> the server that lacks the support of the two-phase commit.

When an error occurred before the local commit, a 2pc-unsupported
server could be rolled back or committed depending on the error
timing. On the other hand all 2pc-supported servers are always rolled
back when an error occurred before the local commit. Therefore even if
we change the order of COMMIT and PREPARE it is still possible that we
will end up committing the part of 2pc-unsupported servers while
rolling back others including 2pc-supported servers.

I guess the motivation of your change is that since errors are likely
to happen during executing PREPARE on foreign servers, we can minimize
the possibility of rolling back 2pc-unsupported servers by deferring
the commit of 2pc-unsupported server as much as possible. Is that
right?

> So I have modified the flow a little bit and instead of doing a one-phase commit right away
> the servers that do not support a two-phase commit is added to another list and that list is
> processed after once we have successfully prepared all the transactions on two-phase supported
> foreign servers. Although this technique is also not bulletproof, still it is better than doing
> the one-phase commits before doing the PREPAREs.

Hmm the current logic seems complex. Maybe we can just reverse the
order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
modified servers first and then do COMMIT on others?

>
> Also, I think we can improve on this one by throwing an error even in PREFER
> mode if there is more than one server that had data modified within the transaction
> and lacks the two-phase commit support.
>

IIUC the concept of PREFER mode is that the transaction uses 2pc only
for 2pc-supported servers. IOW, even if the transaction modifies on a
2pc-unsupported server we can proceed with the commit if in PREFER
mode, which cannot if in REQUIRED mode. What is the motivation of your
above idea?

> 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> memory if fdw_part is removed from the list

I think at the end of the transaction we free entries of
FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
need to do that in PreCommit_FdwXacts()?

>
> 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> (FdwXactParticipants == NIL). The problem with that was in the case of
> "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> effectively the foreign prepared transactions(if any) associated with locally
> prepared transactions were never getting resolved automatically.
>
>
> postgres=# BEGIN;
> BEGIN
> INSERT INTO test_local  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> INSERT 0 1
> postgres=*# PREPARE TRANSACTION 'local_prepared';
> PREPARE TRANSACTION
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
> -- Now commit the prepared transaction
>
> postgres=# COMMIT PREPARED 'local_prepared';
>
> COMMIT PREPARED
>
> --Foreign prepared transactions associated with 'local_prepared' not resolved
>
> postgres=#
>
> postgres=# select * from pg_foreign_xacts ;
> dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> -------+-----+----------+--------+----------+----------+----------------------------
>  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> (2 rows)
>
>
> So to fix this in case of the two-phase transaction, the function checks the existence
> of associated foreign prepared transactions before bailing out.
>

Good catch. But looking at your change, we should not accept the case
where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
false.

       if (FdwXactParticipants == NIL)
       {
               /*
                * If we are here because of COMMIT/ROLLBACK PREPARED then the
                * FdwXactParticipants list would be empty. So we need to
                * see if there are any foreign prepared transactions exists
                * for this prepared transaction
                */
               if (TwoPhaseExists(wait_xid))
               {
                       List *foreign_trans = NIL;

                       foreign_trans = get_fdwxacts(MyDatabaseId,
wait_xid, InvalidOid, InvalidOid,
                                        false, false, true);

                       if (foreign_trans == NIL)
                               return;
                       list_free(foreign_trans);
               }
       }

> 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> that was causing the crash during recovery.

Agreed.

>
> 10- incorporated set_ps_display() signature change.

Thanks.

Regarding other changes you did in v19 patch, I have some comments:

1.
+       ereport(LOG,
+                       (errmsg("trying to %s the foreign transaction
associated with transaction %u on server %u",
+                                       fdwxact->status ==
FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
+                                       fdwxact->local_xid,
fdwxact->serverid)));
+

Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?

2.
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
deleted file mode 120000
index ce8c21880c..0000000000
--- a/src/bin/pg_waldump/fdwxactdesc.c
+++ /dev/null
@@ -1 +0,0 @@
-../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 100644
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c

We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.

3.
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
  Ss   11:34   0:00 postgres: ser
          <entry><literal>SafeSnapshot</literal></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY
DEFERRABLE</literal> transaction.</entry>
         </row>
-        <row>
-         <entry><literal>SyncRep</literal></entry>
-         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
-        </row>
         <row>
          <entry><literal>FdwXactResolution</literal></entry>
          <entry>Waiting for all foreign transaction participants to
be resolved during atomic commit among foreign servers.</entry>
         </row>
+        <row>
+         <entry><literal>SyncRep</literal></entry>
+         <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
+        </row>
         <row>
          <entry morerows="4"><literal>Timeout</literal></entry>
          <entry><literal>BaseBackupThrottle</literal></entry>

We need to move the entry of FdwXactResolution to right before
Hash/Batch/Allocating for alphabetical order.

I've incorporated your changes I agreed with to my local branch and
will incorporate other changes after discussion. I'll also do more
test and self-review and will submit the latest version patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

28 April 2020, 10:37:11

On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
> Hi Sawada San,
>
> I have been further reviewing and testing the transaction involving multiple server patches.
> Overall the patches are working as expected bar a few important exceptions.
> So as discussed over the call I have fixed the issues I found during the testing
> and also rebased the patches with the current head of the master branch.
> So can you please have a look at the attached updated patches.

Thank you for reviewing and updating the patch!

>
> Below is the list of changes I have made on top of V18 patches.
>
> 1- In register_fdwxact(), As we are just storing the callback function pointers from
> FdwRoutine in fdw_part structure, So I think we can avoid calling
> GetFdwRoutineByServerId() in TopMemoryContext.
> So I have moved the MemoryContextSwitch to TopMemoryContext after the
> GetFdwRoutineByServerId() call.

Agreed.

>
>
> 2- If PrepareForeignTransaction functionality is not present in some FDW then
> during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> transaction flag if the modified flag is also set for that server. As for the server that has
> not done any data modification within the transaction we do not do two-phase commit anyway.

Agreed.

>
> 3- I have moved the foreign_twophase_commit in sample file after
> max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> is 0 and enabling the foreign_twophase_commit produces an error with default
> configuration parameter positioning in postgresql.conf
> Also, foreign_twophase_commit configuration was missing the comments
> about allowed values in the sample config file.

Sounds good. Agreed.

>
> 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> function does not seem to be the correct place. The reason being, even when
> is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> to true, we could still end up not using the two-phase commit in the case when some server does
> not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> function after doing the prepare transaction.

Agreed.

>
> 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> the two-phase commit. But instead of doing the single-phase commit right away,
> IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> on servers that support the two-phase. Since an error during a "PREPARE" stage would
> rollback the transaction and in that case, we would end up with committed transactions on
> the server that lacks the support of the two-phase commit.

When an error occurred before the local commit, a 2pc-unsupported
server could be rolled back or committed depending on the error
timing. On the other hand all 2pc-supported servers are always rolled
back when an error occurred before the local commit. Therefore even if
we change the order of COMMIT and PREPARE it is still possible that we
will end up committing the part of 2pc-unsupported servers while
rolling back others including 2pc-supported servers.

I guess the motivation of your change is that since errors are likely
to happen during executing PREPARE on foreign servers, we can minimize
the possibility of rolling back 2pc-unsupported servers by deferring
the commit of 2pc-unsupported server as much as possible. Is that
right?

Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers

after all the PREPAREs are successful is to minimize the chances of partial commits.

And as you mentioned there will still be chances of getting a partial commit even with

this approach but the probability of that would be less than what it is with the

current sequence.

> So I have modified the flow a little bit and instead of doing a one-phase commit right away
> the servers that do not support a two-phase commit is added to another list and that list is
> processed after once we have successfully prepared all the transactions on two-phase supported
> foreign servers. Although this technique is also not bulletproof, still it is better than doing
> the one-phase commits before doing the PREPAREs.

Hmm the current logic seems complex. Maybe we can just reverse the
order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
modified servers first and then do COMMIT on others?

Agreed, seems reasonable.

>
> Also, I think we can improve on this one by throwing an error even in PREFER
> mode if there is more than one server that had data modified within the transaction
> and lacks the two-phase commit support.
>

IIUC the concept of PREFER mode is that the transaction uses 2pc only
for 2pc-supported servers. IOW, even if the transaction modifies on a
2pc-unsupported server we can proceed with the commit if in PREFER
mode, which cannot if in REQUIRED mode. What is the motivation of your
above idea?

I was thinking that we could change the behavior of PREFER mode such that we only allow

to COMMIT the transaction if the transaction needs to do a single-phase commit on one

server only. That way we can ensure that we would never end up with partial commit.

One Idea in this regards would be to switch the local transaction to commit using 2pc

if there is a total of only one foreign server that does not support the 2pc in the transaction,

ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more

than one foreign server requires 1-pc then we just throw an error.

However having said that, I am not 100% sure if its a good or an acceptable Idea, and

I am okay with continuing with the current behavior of PREFER mode if we put it in the

document that this mode can cause a partial commit.

> 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> memory if fdw_part is removed from the list

I think at the end of the transaction we free entries of
FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
need to do that in PreCommit_FdwXacts()?

Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext

and if that fdw_part structure is removed from the list at pre_commit stage

(because we did 1-PC COMMIT on it) then it would leak memory.

>
> 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> (FdwXactParticipants == NIL). The problem with that was in the case of
> "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> effectively the foreign prepared transactions(if any) associated with locally
> prepared transactions were never getting resolved automatically.
>
>
> postgres=# BEGIN;
> BEGIN
> INSERT INTO test_local VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s1 VALUES ( 2, 'TWO');
> INSERT 0 1
> INSERT INTO test_foreign_s2 VALUES ( 2, 'TWO');
> INSERT 0 1
> postgres=*# PREPARE TRANSACTION 'local_prepared';
> PREPARE TRANSACTION
>
> postgres=# select * from pg_foreign_xacts ;
> dbid | xid | serverid | userid | status | in_doubt | identifier
> -------+-----+----------+--------+----------+----------+----------------------------
> 12929 | 515 | 16389 | 10 | prepared | f | fx_1339567411_515_16389_10
> 12929 | 515 | 16391 | 10 | prepared | f | fx_1963224020_515_16391_10
> (2 rows)
>
> -- Now commit the prepared transaction
>
> postgres=# COMMIT PREPARED 'local_prepared';
>
> COMMIT PREPARED
>
> --Foreign prepared transactions associated with 'local_prepared' not resolved
>
> postgres=#
>
> postgres=# select * from pg_foreign_xacts ;
> dbid | xid | serverid | userid | status | in_doubt | identifier
> -------+-----+----------+--------+----------+----------+----------------------------
> 12929 | 515 | 16389 | 10 | prepared | f | fx_1339567411_515_16389_10
> 12929 | 515 | 16391 | 10 | prepared | f | fx_1963224020_515_16391_10
> (2 rows)
>
>
> So to fix this in case of the two-phase transaction, the function checks the existence
> of associated foreign prepared transactions before bailing out.
>

Good catch. But looking at your change, we should not accept the case
where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
false.

if (FdwXactParticipants == NIL)
{
/*
* If we are here because of COMMIT/ROLLBACK PREPARED then the
* FdwXactParticipants list would be empty. So we need to
* see if there are any foreign prepared transactions exists
* for this prepared transaction
*/
if (TwoPhaseExists(wait_xid))
{
List *foreign_trans = NIL;

foreign_trans = get_fdwxacts(MyDatabaseId,
wait_xid, InvalidOid, InvalidOid,
false, false, true);

if (foreign_trans == NIL)
return;
list_free(foreign_trans);
}
}

Sorry my bad, its a mistake on my part. we should just return from the function when

FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.

if (TwoPhaseExists(wait_xid))
{
List *foreign_trans = NIL;
foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
false, false, true);

if (foreign_trans == NIL)
return;
list_free(foreign_trans);
}
else
return;

> 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> that was causing the crash during recovery.

Agreed.

>
> 10- incorporated set_ps_display() signature change.

Thanks.

Regarding other changes you did in v19 patch, I have some comments:

1.
+ ereport(LOG,
+ (errmsg("trying to %s the foreign transaction
associated with transaction %u on server %u",
+ fdwxact->status ==
FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
+ fdwxact->local_xid,
fdwxact->serverid)));
+

Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?

That change was not intended to get into the patch file. I had done it during testing to

quickly get info on which way the transaction is going to be resolved.

2.
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
deleted file mode 120000
index ce8c21880c..0000000000
--- a/src/bin/pg_waldump/fdwxactdesc.c
+++ /dev/null
@@ -1 +0,0 @@
-../../../src/backend/access/rmgrdesc/fdwxactdesc.c
\ No newline at end of file
diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
new file mode 100644
index 0000000000..ce8c21880c
--- /dev/null
+++ b/src/bin/pg_waldump/fdwxactdesc.c
@@ -0,0 +1 @@
+../../../src/backend/access/rmgrdesc/fdwxactdesc.c

We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.

Again sorry! that was an oversight on my part.

3.
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1526,14 +1526,14 @@ postgres 27093 0.0 0.0 30096 2752 ?
Ss 11:34 0:00 postgres: ser
<entry><literal>SafeSnapshot</literal></entry>
<entry>Waiting for a snapshot for a <literal>READ ONLY
DEFERRABLE</literal> transaction.</entry>
</row>
- <row>
- <entry><literal>SyncRep</literal></entry>
- <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
- </row>
<row>
<entry><literal>FdwXactResolution</literal></entry>
<entry>Waiting for all foreign transaction participants to
be resolved during atomic commit among foreign servers.</entry>
</row>
+ <row>
+ <entry><literal>SyncRep</literal></entry>
+ <entry>Waiting for confirmation from remote server during
synchronous replication.</entry>
+ </row>
<row>
<entry morerows="4"><literal>Timeout</literal></entry>
<entry><literal>BaseBackupThrottle</literal></entry>

We need to move the entry of FdwXactResolution to right before
Hash/Batch/Allocating for alphabetical order.

Agreed!

I've incorporated your changes I agreed with to my local branch and
will incorporate other changes after discussion. I'll also do more
test and self-review and will submit the latest version patch.

Meanwhile, I found a couple of more small issues, One is the break statement missing
i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
could return un-initialized value.
I am attaching a small patch for these changes that can be applied on top of existing
patches.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Best Regards,

Muhammad Usama
Highgo Software
URL : http://www.highgo.ca

Attachment

fdwxact_fixes.diff

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

30 April 2020, 11:43:23

On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> > Hi Sawada San,
>> >
>> > I have been further reviewing and testing the transaction involving multiple server patches.
>> > Overall the patches are working as expected bar a few important exceptions.
>> > So as discussed over the call I have fixed the issues I found during the testing
>> > and also rebased the patches with the current head of the master branch.
>> > So can you please have a look at the attached updated patches.
>>
>> Thank you for reviewing and updating the patch!
>>
>> >
>> > Below is the list of changes I have made on top of V18 patches.
>> >
>> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
>> > FdwRoutine in fdw_part structure, So I think we can avoid calling
>> > GetFdwRoutineByServerId() in TopMemoryContext.
>> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
>> > GetFdwRoutineByServerId() call.
>>
>> Agreed.
>>
>> >
>> >
>> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
>> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
>> > transaction flag if the modified flag is also set for that server. As for the server that has
>> > not done any data modification within the transaction we do not do two-phase commit anyway.
>>
>> Agreed.
>>
>> >
>> > 3- I have moved the foreign_twophase_commit in sample file after
>> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
>> > is 0 and enabling the foreign_twophase_commit produces an error with default
>> > configuration parameter positioning in postgresql.conf
>> > Also, foreign_twophase_commit configuration was missing the comments
>> > about allowed values in the sample config file.
>>
>> Sounds good. Agreed.
>>
>> >
>> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
>> > function does not seem to be the correct place. The reason being, even when
>> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
>> > to true, we could still end up not using the two-phase commit in the case when some server does
>> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
>> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
>> > function after doing the prepare transaction.
>>
>> Agreed.
>>
>> >
>> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
>> > the two-phase commit. But instead of doing the single-phase commit right away,
>> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
>> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
>> > rollback the transaction and in that case, we would end up with committed transactions on
>> > the server that lacks the support of the two-phase commit.
>>
>> When an error occurred before the local commit, a 2pc-unsupported
>> server could be rolled back or committed depending on the error
>> timing. On the other hand all 2pc-supported servers are always rolled
>> back when an error occurred before the local commit. Therefore even if
>> we change the order of COMMIT and PREPARE it is still possible that we
>> will end up committing the part of 2pc-unsupported servers while
>> rolling back others including 2pc-supported servers.
>>
>> I guess the motivation of your change is that since errors are likely
>> to happen during executing PREPARE on foreign servers, we can minimize
>> the possibility of rolling back 2pc-unsupported servers by deferring
>> the commit of 2pc-unsupported server as much as possible. Is that
>> right?
>
>
> Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> after all the PREPAREs are successful is to minimize the chances of partial commits.
> And as you mentioned there will still be chances of getting a partial commit even with
> this approach but the probability of that would be less than what it is with the
> current sequence.
>
>
>>
>>
>> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
>> > the servers that do not support a two-phase commit is added to another list and that list is
>> > processed after once we have successfully prepared all the transactions on two-phase supported
>> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
>> > the one-phase commits before doing the PREPAREs.
>>
>> Hmm the current logic seems complex. Maybe we can just reverse the
>> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
>> modified servers first and then do COMMIT on others?
>
>
> Agreed, seems reasonable.
>>
>>
>> >
>> > Also, I think we can improve on this one by throwing an error even in PREFER
>> > mode if there is more than one server that had data modified within the transaction
>> > and lacks the two-phase commit support.
>> >
>>
>> IIUC the concept of PREFER mode is that the transaction uses 2pc only
>> for 2pc-supported servers. IOW, even if the transaction modifies on a
>> 2pc-unsupported server we can proceed with the commit if in PREFER
>> mode, which cannot if in REQUIRED mode. What is the motivation of your
>> above idea?
>
>
> I was thinking that we could change the behavior of PREFER mode such that we only allow
> to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> server only. That way we can ensure that we would never end up with partial commit.
>

I think it's good to avoid a partial commit by using your idea but if
we want to avoid a partial commit we can use the 'required' mode,
which requires all participant servers to support 2pc. We throw an
error if participant servers include even one 2pc-unsupported server
is modified within the transaction. Of course if the participant node
is only one 2pc-unsupported server it can use 1pc even in the
'required' mode.

> One Idea in this regards would be to switch the local transaction to commit using 2pc
> if there is a total of only one foreign server that does not support the 2pc in the transaction,
> ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> than one foreign server requires 1-pc then we just throw an error.

I might be missing your point but I suppose this idea is to do
something like the following?

1. prepare the local transaction
2. commit the foreign transaction on 2pc-unsupported server
3. commit the prepared local transaction

>
> However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> I am okay with continuing with the current behavior of PREFER mode if we put it in the
> document that this mode can cause a partial commit.

There will three types of servers: (a) a server doesn't support any
transaction API, (b) a server supports only commit and rollback API
and (c) a server supports all APIs (commit, rollback and prepare).
Currently postgres transaction manager manages only server-(b) and
server-(c), adds them to FdwXactParticipants. I'm considering changing
the code so that it adds also server-(a) to FdwXactParticipants, in
order to track the number of server-(a) involved in the transaction.
But it doesn't insert FdwXact entry for it, and manage transactions on
these servers.

The reason is this; if we want to have the 'required' mode strictly
require all participant servers to support 2pc, we should use 2pc when
(# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
currently we just track the modification on a server-(a) by a flag we
cannot handle the case where two server-(a) are modified in the
transaction. On the other hand, if we don't consider server-(a) the
transaction could end up with a partial commit when a server-(a)
participates in the transaction. Therefore I'm thinking of the above
change so that the transaction manager can ensure that a partial
commit doesn't happen in the 'required' mode. What do you think?

>
>>
>> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
>> > memory if fdw_part is removed from the list
>>
>> I think at the end of the transaction we free entries of
>> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
>> need to do that in PreCommit_FdwXacts()?
>
>
> Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> and if that fdw_part structure is removed from the list at pre_commit stage
> (because we did 1-PC COMMIT on it) then it would leak memory.

The fdw_part structures are created in TopTransactionContext so these
are freed at the end of the transaction.

>
>>
>> >
>> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
>> > (FdwXactParticipants == NIL). The problem with that was in the case of
>> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
>> > effectively the foreign prepared transactions(if any) associated with locally
>> > prepared transactions were never getting resolved automatically.
>> >
>> >
>> > postgres=# BEGIN;
>> > BEGIN
>> > INSERT INTO test_local  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
>> > INSERT 0 1
>> > postgres=*# PREPARE TRANSACTION 'local_prepared';
>> > PREPARE TRANSACTION
>> >
>> > postgres=# select * from pg_foreign_xacts ;
>> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
>> > -------+-----+----------+--------+----------+----------+----------------------------
>> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
>> > (2 rows)
>> >
>> > -- Now commit the prepared transaction
>> >
>> > postgres=# COMMIT PREPARED 'local_prepared';
>> >
>> > COMMIT PREPARED
>> >
>> > --Foreign prepared transactions associated with 'local_prepared' not resolved
>> >
>> > postgres=#
>> >
>> > postgres=# select * from pg_foreign_xacts ;
>> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
>> > -------+-----+----------+--------+----------+----------+----------------------------
>> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
>> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
>> > (2 rows)
>> >
>> >
>> > So to fix this in case of the two-phase transaction, the function checks the existence
>> > of associated foreign prepared transactions before bailing out.
>> >
>>
>> Good catch. But looking at your change, we should not accept the case
>> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
>> false.
>>
>>        if (FdwXactParticipants == NIL)
>>        {
>>                /*
>>                 * If we are here because of COMMIT/ROLLBACK PREPARED then the
>>                 * FdwXactParticipants list would be empty. So we need to
>>                 * see if there are any foreign prepared transactions exists
>>                 * for this prepared transaction
>>                 */
>>                if (TwoPhaseExists(wait_xid))
>>                {
>>                        List *foreign_trans = NIL;
>>
>>                        foreign_trans = get_fdwxacts(MyDatabaseId,
>> wait_xid, InvalidOid, InvalidOid,
>>                                         false, false, true);
>>
>>                        if (foreign_trans == NIL)
>>                                return;
>>                        list_free(foreign_trans);
>>                }
>>        }
>>
>
> Sorry my bad, its a mistake on my part. we should just return from the function when
> FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
>
>         if (TwoPhaseExists(wait_xid))
>         {
>             List *foreign_trans = NIL;
>             foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
>                      false, false, true);
>
>             if (foreign_trans == NIL)
>                 return;
>             list_free(foreign_trans);
>         }
>         else
>             return;
>
>>
>> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
>> > that was causing the crash during recovery.
>>
>> Agreed.
>>
>> >
>> > 10- incorporated set_ps_display() signature change.
>>
>> Thanks.
>>
>> Regarding other changes you did in v19 patch, I have some comments:
>>
>> 1.
>> +       ereport(LOG,
>> +                       (errmsg("trying to %s the foreign transaction
>> associated with transaction %u on server %u",
>> +                                       fdwxact->status ==
>> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
>> +                                       fdwxact->local_xid,
>> fdwxact->serverid)));
>> +
>>
>> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
>
>
> That change was not intended to get into the patch file. I had done it during testing to
> quickly get info on which way the transaction is going to be resolved.
>
>>
>> 2.
>> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
>> deleted file mode 120000
>> index ce8c21880c..0000000000
>> --- a/src/bin/pg_waldump/fdwxactdesc.c
>> +++ /dev/null
>> @@ -1 +0,0 @@
>> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
>> \ No newline at end of file
>> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
>> new file mode 100644
>> index 0000000000..ce8c21880c
>> --- /dev/null
>> +++ b/src/bin/pg_waldump/fdwxactdesc.c
>> @@ -0,0 +1 @@
>> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
>>
>> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
>
>
> Again sorry! that was an oversight on my part.
>
>>
>> 3.
>> --- a/doc/src/sgml/monitoring.sgml
>> +++ b/doc/src/sgml/monitoring.sgml
>> @@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
>>   Ss   11:34   0:00 postgres: ser
>>           <entry><literal>SafeSnapshot</literal></entry>
>>           <entry>Waiting for a snapshot for a <literal>READ ONLY
>> DEFERRABLE</literal> transaction.</entry>
>>          </row>
>> -        <row>
>> -         <entry><literal>SyncRep</literal></entry>
>> -         <entry>Waiting for confirmation from remote server during
>> synchronous replication.</entry>
>> -        </row>
>>          <row>
>>           <entry><literal>FdwXactResolution</literal></entry>
>>           <entry>Waiting for all foreign transaction participants to
>> be resolved during atomic commit among foreign servers.</entry>
>>          </row>
>> +        <row>
>> +         <entry><literal>SyncRep</literal></entry>
>> +         <entry>Waiting for confirmation from remote server during
>> synchronous replication.</entry>
>> +        </row>
>>          <row>
>>           <entry morerows="4"><literal>Timeout</literal></entry>
>>           <entry><literal>BaseBackupThrottle</literal></entry>
>>
>> We need to move the entry of FdwXactResolution to right before
>> Hash/Batch/Allocating for alphabetical order.
>
>
> Agreed!
>>
>>
>> I've incorporated your changes I agreed with to my local branch and
>> will incorporate other changes after discussion. I'll also do more
>> test and self-review and will submit the latest version patch.
>>
>
> Meanwhile, I found a couple of more small issues, One is the break statement missing
> i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> could return un-initialized value.
> I am attaching a small patch for these changes that can be applied on top of existing
> patches.

Thank you for the patch!

I'm updating the patches because current behavior in error case would
not be good. For example, when an error occurs in the prepare phase,
prepared transactions are left as in-doubt transaction. And these
transactions are not handled by the resolver process. That means that
a user could need to resolve these transactions manually every abort
time, which is not good. In abort case, I think that prepared
transactions can be resolved by the backend itself, rather than
leaving them for the resolver. I'll submit the updated patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

12 May 2020, 06:44:54

On Thu, 30 Apr 2020 at 20:43, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
> >
> >
> >
> > On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
> >> >
> >> > Hi Sawada San,
> >> >
> >> > I have been further reviewing and testing the transaction involving multiple server patches.
> >> > Overall the patches are working as expected bar a few important exceptions.
> >> > So as discussed over the call I have fixed the issues I found during the testing
> >> > and also rebased the patches with the current head of the master branch.
> >> > So can you please have a look at the attached updated patches.
> >>
> >> Thank you for reviewing and updating the patch!
> >>
> >> >
> >> > Below is the list of changes I have made on top of V18 patches.
> >> >
> >> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
> >> > FdwRoutine in fdw_part structure, So I think we can avoid calling
> >> > GetFdwRoutineByServerId() in TopMemoryContext.
> >> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
> >> > GetFdwRoutineByServerId() call.
> >>
> >> Agreed.
> >>
> >> >
> >> >
> >> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
> >> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> >> > transaction flag if the modified flag is also set for that server. As for the server that has
> >> > not done any data modification within the transaction we do not do two-phase commit anyway.
> >>
> >> Agreed.
> >>
> >> >
> >> > 3- I have moved the foreign_twophase_commit in sample file after
> >> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> >> > is 0 and enabling the foreign_twophase_commit produces an error with default
> >> > configuration parameter positioning in postgresql.conf
> >> > Also, foreign_twophase_commit configuration was missing the comments
> >> > about allowed values in the sample config file.
> >>
> >> Sounds good. Agreed.
> >>
> >> >
> >> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> >> > function does not seem to be the correct place. The reason being, even when
> >> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> >> > to true, we could still end up not using the two-phase commit in the case when some server does
> >> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> >> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> >> > function after doing the prepare transaction.
> >>
> >> Agreed.
> >>
> >> >
> >> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> >> > the two-phase commit. But instead of doing the single-phase commit right away,
> >> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> >> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
> >> > rollback the transaction and in that case, we would end up with committed transactions on
> >> > the server that lacks the support of the two-phase commit.
> >>
> >> When an error occurred before the local commit, a 2pc-unsupported
> >> server could be rolled back or committed depending on the error
> >> timing. On the other hand all 2pc-supported servers are always rolled
> >> back when an error occurred before the local commit. Therefore even if
> >> we change the order of COMMIT and PREPARE it is still possible that we
> >> will end up committing the part of 2pc-unsupported servers while
> >> rolling back others including 2pc-supported servers.
> >>
> >> I guess the motivation of your change is that since errors are likely
> >> to happen during executing PREPARE on foreign servers, we can minimize
> >> the possibility of rolling back 2pc-unsupported servers by deferring
> >> the commit of 2pc-unsupported server as much as possible. Is that
> >> right?
> >
> >
> > Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> > after all the PREPAREs are successful is to minimize the chances of partial commits.
> > And as you mentioned there will still be chances of getting a partial commit even with
> > this approach but the probability of that would be less than what it is with the
> > current sequence.
> >
> >
> >>
> >>
> >> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
> >> > the servers that do not support a two-phase commit is added to another list and that list is
> >> > processed after once we have successfully prepared all the transactions on two-phase supported
> >> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
> >> > the one-phase commits before doing the PREPAREs.
> >>
> >> Hmm the current logic seems complex. Maybe we can just reverse the
> >> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
> >> modified servers first and then do COMMIT on others?
> >
> >
> > Agreed, seems reasonable.
> >>
> >>
> >> >
> >> > Also, I think we can improve on this one by throwing an error even in PREFER
> >> > mode if there is more than one server that had data modified within the transaction
> >> > and lacks the two-phase commit support.
> >> >
> >>
> >> IIUC the concept of PREFER mode is that the transaction uses 2pc only
> >> for 2pc-supported servers. IOW, even if the transaction modifies on a
> >> 2pc-unsupported server we can proceed with the commit if in PREFER
> >> mode, which cannot if in REQUIRED mode. What is the motivation of your
> >> above idea?
> >
> >
> > I was thinking that we could change the behavior of PREFER mode such that we only allow
> > to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> > server only. That way we can ensure that we would never end up with partial commit.
> >
>
> I think it's good to avoid a partial commit by using your idea but if
> we want to avoid a partial commit we can use the 'required' mode,
> which requires all participant servers to support 2pc. We throw an
> error if participant servers include even one 2pc-unsupported server
> is modified within the transaction. Of course if the participant node
> is only one 2pc-unsupported server it can use 1pc even in the
> 'required' mode.
>
> > One Idea in this regards would be to switch the local transaction to commit using 2pc
> > if there is a total of only one foreign server that does not support the 2pc in the transaction,
> > ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> > than one foreign server requires 1-pc then we just throw an error.
>
> I might be missing your point but I suppose this idea is to do
> something like the following?
>
> 1. prepare the local transaction
> 2. commit the foreign transaction on 2pc-unsupported server
> 3. commit the prepared local transaction
>
> >
> > However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> > I am okay with continuing with the current behavior of PREFER mode if we put it in the
> > document that this mode can cause a partial commit.
>
> There will three types of servers: (a) a server doesn't support any
> transaction API, (b) a server supports only commit and rollback API
> and (c) a server supports all APIs (commit, rollback and prepare).
> Currently postgres transaction manager manages only server-(b) and
> server-(c), adds them to FdwXactParticipants. I'm considering changing
> the code so that it adds also server-(a) to FdwXactParticipants, in
> order to track the number of server-(a) involved in the transaction.
> But it doesn't insert FdwXact entry for it, and manage transactions on
> these servers.
>
> The reason is this; if we want to have the 'required' mode strictly
> require all participant servers to support 2pc, we should use 2pc when
> (# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
> currently we just track the modification on a server-(a) by a flag we
> cannot handle the case where two server-(a) are modified in the
> transaction. On the other hand, if we don't consider server-(a) the
> transaction could end up with a partial commit when a server-(a)
> participates in the transaction. Therefore I'm thinking of the above
> change so that the transaction manager can ensure that a partial
> commit doesn't happen in the 'required' mode. What do you think?
>
> >
> >>
> >> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> >> > memory if fdw_part is removed from the list
> >>
> >> I think at the end of the transaction we free entries of
> >> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
> >> need to do that in PreCommit_FdwXacts()?
> >
> >
> > Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> > and if that fdw_part structure is removed from the list at pre_commit stage
> > (because we did 1-PC COMMIT on it) then it would leak memory.
>
> The fdw_part structures are created in TopTransactionContext so these
> are freed at the end of the transaction.
>
> >
> >>
> >> >
> >> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> >> > (FdwXactParticipants == NIL). The problem with that was in the case of
> >> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> >> > effectively the foreign prepared transactions(if any) associated with locally
> >> > prepared transactions were never getting resolved automatically.
> >> >
> >> >
> >> > postgres=# BEGIN;
> >> > BEGIN
> >> > INSERT INTO test_local  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s1  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s2  VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > postgres=*# PREPARE TRANSACTION 'local_prepared';
> >> > PREPARE TRANSACTION
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> > -- Now commit the prepared transaction
> >> >
> >> > postgres=# COMMIT PREPARED 'local_prepared';
> >> >
> >> > COMMIT PREPARED
> >> >
> >> > --Foreign prepared transactions associated with 'local_prepared' not resolved
> >> >
> >> > postgres=#
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid  | xid | serverid | userid |  status  | in_doubt |         identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> >  12929 | 515 |    16389 |     10 | prepared | f        | fx_1339567411_515_16389_10
> >> >  12929 | 515 |    16391 |     10 | prepared | f        | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> >
> >> > So to fix this in case of the two-phase transaction, the function checks the existence
> >> > of associated foreign prepared transactions before bailing out.
> >> >
> >>
> >> Good catch. But looking at your change, we should not accept the case
> >> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
> >> false.
> >>
> >>        if (FdwXactParticipants == NIL)
> >>        {
> >>                /*
> >>                 * If we are here because of COMMIT/ROLLBACK PREPARED then the
> >>                 * FdwXactParticipants list would be empty. So we need to
> >>                 * see if there are any foreign prepared transactions exists
> >>                 * for this prepared transaction
> >>                 */
> >>                if (TwoPhaseExists(wait_xid))
> >>                {
> >>                        List *foreign_trans = NIL;
> >>
> >>                        foreign_trans = get_fdwxacts(MyDatabaseId,
> >> wait_xid, InvalidOid, InvalidOid,
> >>                                         false, false, true);
> >>
> >>                        if (foreign_trans == NIL)
> >>                                return;
> >>                        list_free(foreign_trans);
> >>                }
> >>        }
> >>
> >
> > Sorry my bad, its a mistake on my part. we should just return from the function when
> > FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
> >
> >         if (TwoPhaseExists(wait_xid))
> >         {
> >             List *foreign_trans = NIL;
> >             foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
> >                      false, false, true);
> >
> >             if (foreign_trans == NIL)
> >                 return;
> >             list_free(foreign_trans);
> >         }
> >         else
> >             return;
> >
> >>
> >> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> >> > that was causing the crash during recovery.
> >>
> >> Agreed.
> >>
> >> >
> >> > 10- incorporated set_ps_display() signature change.
> >>
> >> Thanks.
> >>
> >> Regarding other changes you did in v19 patch, I have some comments:
> >>
> >> 1.
> >> +       ereport(LOG,
> >> +                       (errmsg("trying to %s the foreign transaction
> >> associated with transaction %u on server %u",
> >> +                                       fdwxact->status ==
> >> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
> >> +                                       fdwxact->local_xid,
> >> fdwxact->serverid)));
> >> +
> >>
> >> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
> >
> >
> > That change was not intended to get into the patch file. I had done it during testing to
> > quickly get info on which way the transaction is going to be resolved.
> >
> >>
> >> 2.
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> deleted file mode 120000
> >> index ce8c21880c..0000000000
> >> --- a/src/bin/pg_waldump/fdwxactdesc.c
> >> +++ /dev/null
> >> @@ -1 +0,0 @@
> >> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >> \ No newline at end of file
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> new file mode 100644
> >> index 0000000000..ce8c21880c
> >> --- /dev/null
> >> +++ b/src/bin/pg_waldump/fdwxactdesc.c
> >> @@ -0,0 +1 @@
> >> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >>
> >> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
> >
> >
> > Again sorry! that was an oversight on my part.
> >
> >>
> >> 3.
> >> --- a/doc/src/sgml/monitoring.sgml
> >> +++ b/doc/src/sgml/monitoring.sgml
> >> @@ -1526,14 +1526,14 @@ postgres   27093  0.0  0.0  30096  2752 ?
> >>   Ss   11:34   0:00 postgres: ser
> >>           <entry><literal>SafeSnapshot</literal></entry>
> >>           <entry>Waiting for a snapshot for a <literal>READ ONLY
> >> DEFERRABLE</literal> transaction.</entry>
> >>          </row>
> >> -        <row>
> >> -         <entry><literal>SyncRep</literal></entry>
> >> -         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> -        </row>
> >>          <row>
> >>           <entry><literal>FdwXactResolution</literal></entry>
> >>           <entry>Waiting for all foreign transaction participants to
> >> be resolved during atomic commit among foreign servers.</entry>
> >>          </row>
> >> +        <row>
> >> +         <entry><literal>SyncRep</literal></entry>
> >> +         <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> +        </row>
> >>          <row>
> >>           <entry morerows="4"><literal>Timeout</literal></entry>
> >>           <entry><literal>BaseBackupThrottle</literal></entry>
> >>
> >> We need to move the entry of FdwXactResolution to right before
> >> Hash/Batch/Allocating for alphabetical order.
> >
> >
> > Agreed!
> >>
> >>
> >> I've incorporated your changes I agreed with to my local branch and
> >> will incorporate other changes after discussion. I'll also do more
> >> test and self-review and will submit the latest version patch.
> >>
> >
> > Meanwhile, I found a couple of more small issues, One is the break statement missing
> > i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> > could return un-initialized value.
> > I am attaching a small patch for these changes that can be applied on top of existing
> > patches.
>
> Thank you for the patch!
>
> I'm updating the patches because current behavior in error case would
> not be good. For example, when an error occurs in the prepare phase,
> prepared transactions are left as in-doubt transaction. And these
> transactions are not handled by the resolver process. That means that
> a user could need to resolve these transactions manually every abort
> time, which is not good. In abort case, I think that prepared
> transactions can be resolved by the backend itself, rather than
> leaving them for the resolver. I'll submit the updated patch.
>

I've attached the latest version patch set which includes some changes
from the previous version:

* I've added regression tests that test all types of FDW
implementations. There are three types of FDW: FDW doesn't support any
transaction APIs, FDW supports only commit and rollback APIs and FDW
supports all (prepare, commit and rollback) APISs.
src/test/module/test_fdwxact contains those FDW implementations for
tests, and test some cases where a transaction reads/writes data on
various types of foreign servers.
* Also test_fdwxact has TAP tests that check failure cases. The test
FDW implementation has the ability to inject error or panic into
prepare or commit phase. Using it the TAP test checks if distributed
transactions can be committed or rolled back even in failure cases.
* When foreign_twophase_commit = 'required', the transaction commit
fails if the transaction modified data on even one server not
supporting prepare API. Previously, we used to ignore servers that
don't support any transaction API but we check them to strictly
require all involved foreign servers to support all transaction APIs.
* Transaction resolver process resolves in-doubt transactions automatically.
* Incorporated comments from Muhammad Usama.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

14 May 2020, 18:08:32

On Tue, May 12, 2020 at 11:45 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Thu, 30 Apr 2020 at 20:43, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 28 Apr 2020 at 19:37, Muhammad Usama <m.usama@gmail.com> wrote:
> >
> >
> >
> > On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>
> >> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama <m.usama@gmail.com> wrote:
> >> >
> >> > Hi Sawada San,
> >> >
> >> > I have been further reviewing and testing the transaction involving multiple server patches.
> >> > Overall the patches are working as expected bar a few important exceptions.
> >> > So as discussed over the call I have fixed the issues I found during the testing
> >> > and also rebased the patches with the current head of the master branch.
> >> > So can you please have a look at the attached updated patches.
> >>
> >> Thank you for reviewing and updating the patch!
> >>
> >> >
> >> > Below is the list of changes I have made on top of V18 patches.
> >> >
> >> > 1- In register_fdwxact(), As we are just storing the callback function pointers from
> >> > FdwRoutine in fdw_part structure, So I think we can avoid calling
> >> > GetFdwRoutineByServerId() in TopMemoryContext.
> >> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
> >> > GetFdwRoutineByServerId() call.
> >>
> >> Agreed.
> >>
> >> >
> >> >
> >> > 2- If PrepareForeignTransaction functionality is not present in some FDW then
> >> > during the registration process we should only set the XACT_FLAGS_FDWNOPREPARE
> >> > transaction flag if the modified flag is also set for that server. As for the server that has
> >> > not done any data modification within the transaction we do not do two-phase commit anyway.
> >>
> >> Agreed.
> >>
> >> >
> >> > 3- I have moved the foreign_twophase_commit in sample file after
> >> > max_foreign_transaction_resolvers because the default value of max_foreign_transaction_resolvers
> >> > is 0 and enabling the foreign_twophase_commit produces an error with default
> >> > configuration parameter positioning in postgresql.conf
> >> > Also, foreign_twophase_commit configuration was missing the comments
> >> > about allowed values in the sample config file.
> >>
> >> Sounds good. Agreed.
> >>
> >> >
> >> > 4- Setting ForeignTwophaseCommitIsRequired in is_foreign_twophase_commit_required()
> >> > function does not seem to be the correct place. The reason being, even when
> >> > is_foreign_twophase_commit_required() returns true after setting ForeignTwophaseCommitIsRequired
> >> > to true, we could still end up not using the two-phase commit in the case when some server does
> >> > not support two-phase commit and foreign_twophase_commit is set to FOREIGN_TWOPHASE_COMMIT_PREFER
> >> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to PreCommit_FdwXacts()
> >> > function after doing the prepare transaction.
> >>
> >> Agreed.
> >>
> >> >
> >> > 6- In prefer mode, we commit the transaction in single-phase if the server does not support
> >> > the two-phase commit. But instead of doing the single-phase commit right away,
> >> > IMHO the better way is to wait until all the two-phase transactions are successfully prepared
> >> > on servers that support the two-phase. Since an error during a "PREPARE" stage would
> >> > rollback the transaction and in that case, we would end up with committed transactions on
> >> > the server that lacks the support of the two-phase commit.
> >>
> >> When an error occurred before the local commit, a 2pc-unsupported
> >> server could be rolled back or committed depending on the error
> >> timing. On the other hand all 2pc-supported servers are always rolled
> >> back when an error occurred before the local commit. Therefore even if
> >> we change the order of COMMIT and PREPARE it is still possible that we
> >> will end up committing the part of 2pc-unsupported servers while
> >> rolling back others including 2pc-supported servers.
> >>
> >> I guess the motivation of your change is that since errors are likely
> >> to happen during executing PREPARE on foreign servers, we can minimize
> >> the possibility of rolling back 2pc-unsupported servers by deferring
> >> the commit of 2pc-unsupported server as much as possible. Is that
> >> right?
> >
> >
> > Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported servers
> > after all the PREPAREs are successful is to minimize the chances of partial commits.
> > And as you mentioned there will still be chances of getting a partial commit even with
> > this approach but the probability of that would be less than what it is with the
> > current sequence.
> >
> >
> >>
> >>
> >> > So I have modified the flow a little bit and instead of doing a one-phase commit right away
> >> > the servers that do not support a two-phase commit is added to another list and that list is
> >> > processed after once we have successfully prepared all the transactions on two-phase supported
> >> > foreign servers. Although this technique is also not bulletproof, still it is better than doing
> >> > the one-phase commits before doing the PREPAREs.
> >>
> >> Hmm the current logic seems complex. Maybe we can just reverse the
> >> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
> >> modified servers first and then do COMMIT on others?
> >
> >
> > Agreed, seems reasonable.
> >>
> >>
> >> >
> >> > Also, I think we can improve on this one by throwing an error even in PREFER
> >> > mode if there is more than one server that had data modified within the transaction
> >> > and lacks the two-phase commit support.
> >> >
> >>
> >> IIUC the concept of PREFER mode is that the transaction uses 2pc only
> >> for 2pc-supported servers. IOW, even if the transaction modifies on a
> >> 2pc-unsupported server we can proceed with the commit if in PREFER
> >> mode, which cannot if in REQUIRED mode. What is the motivation of your
> >> above idea?
> >
> >
> > I was thinking that we could change the behavior of PREFER mode such that we only allow
> > to COMMIT the transaction if the transaction needs to do a single-phase commit on one
> > server only. That way we can ensure that we would never end up with partial commit.
> >
>
> I think it's good to avoid a partial commit by using your idea but if
> we want to avoid a partial commit we can use the 'required' mode,
> which requires all participant servers to support 2pc. We throw an
> error if participant servers include even one 2pc-unsupported server
> is modified within the transaction. Of course if the participant node
> is only one 2pc-unsupported server it can use 1pc even in the
> 'required' mode.
>
> > One Idea in this regards would be to switch the local transaction to commit using 2pc
> > if there is a total of only one foreign server that does not support the 2pc in the transaction,
> > ensuring that 1-pc commit servers should always be less than or equal to 1. and if there are more
> > than one foreign server requires 1-pc then we just throw an error.
>
> I might be missing your point but I suppose this idea is to do
> something like the following?
>
> 1. prepare the local transaction
> 2. commit the foreign transaction on 2pc-unsupported server
> 3. commit the prepared local transaction
>
> >
> > However having said that, I am not 100% sure if its a good or an acceptable Idea, and
> > I am okay with continuing with the current behavior of PREFER mode if we put it in the
> > document that this mode can cause a partial commit.
>
> There will three types of servers: (a) a server doesn't support any
> transaction API, (b) a server supports only commit and rollback API
> and (c) a server supports all APIs (commit, rollback and prepare).
> Currently postgres transaction manager manages only server-(b) and
> server-(c), adds them to FdwXactParticipants. I'm considering changing
> the code so that it adds also server-(a) to FdwXactParticipants, in
> order to track the number of server-(a) involved in the transaction.
> But it doesn't insert FdwXact entry for it, and manage transactions on
> these servers.
>
> The reason is this; if we want to have the 'required' mode strictly
> require all participant servers to support 2pc, we should use 2pc when
> (# of server-(a) + # of server-(b) + # of server-(c)) >= 2. But since
> currently we just track the modification on a server-(a) by a flag we
> cannot handle the case where two server-(a) are modified in the
> transaction. On the other hand, if we don't consider server-(a) the
> transaction could end up with a partial commit when a server-(a)
> participates in the transaction. Therefore I'm thinking of the above
> change so that the transaction manager can ensure that a partial
> commit doesn't happen in the 'required' mode. What do you think?
>
> >
> >>
> >> > 7- Added a pfree() and list_free_deep() in PreCommit_FdwXacts() to reclaim the
> >> > memory if fdw_part is removed from the list
> >>
> >> I think at the end of the transaction we free entries of
> >> FdwXactParticipants list and set FdwXactParticipants to NIL. Why do we
> >> need to do that in PreCommit_FdwXacts()?
> >
> >
> > Correct me if I am wrong, The fdw_part structures are created in TopMemoryContext
> > and if that fdw_part structure is removed from the list at pre_commit stage
> > (because we did 1-PC COMMIT on it) then it would leak memory.
>
> The fdw_part structures are created in TopTransactionContext so these
> are freed at the end of the transaction.
>
> >
> >>
> >> >
> >> > 8- The function FdwXactWaitToBeResolved() was bailing out as soon as it finds
> >> > (FdwXactParticipants == NIL). The problem with that was in the case of
> >> > "COMMIT/ROLLBACK PREPARED" we always get FdwXactParticipants = NIL and
> >> > effectively the foreign prepared transactions(if any) associated with locally
> >> > prepared transactions were never getting resolved automatically.
> >> >
> >> >
> >> > postgres=# BEGIN;
> >> > BEGIN
> >> > INSERT INTO test_local VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s1 VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > INSERT INTO test_foreign_s2 VALUES ( 2, 'TWO');
> >> > INSERT 0 1
> >> > postgres=*# PREPARE TRANSACTION 'local_prepared';
> >> > PREPARE TRANSACTION
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid | xid | serverid | userid | status | in_doubt | identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> > 12929 | 515 | 16389 | 10 | prepared | f | fx_1339567411_515_16389_10
> >> > 12929 | 515 | 16391 | 10 | prepared | f | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> > -- Now commit the prepared transaction
> >> >
> >> > postgres=# COMMIT PREPARED 'local_prepared';
> >> >
> >> > COMMIT PREPARED
> >> >
> >> > --Foreign prepared transactions associated with 'local_prepared' not resolved
> >> >
> >> > postgres=#
> >> >
> >> > postgres=# select * from pg_foreign_xacts ;
> >> > dbid | xid | serverid | userid | status | in_doubt | identifier
> >> > -------+-----+----------+--------+----------+----------+----------------------------
> >> > 12929 | 515 | 16389 | 10 | prepared | f | fx_1339567411_515_16389_10
> >> > 12929 | 515 | 16391 | 10 | prepared | f | fx_1963224020_515_16391_10
> >> > (2 rows)
> >> >
> >> >
> >> > So to fix this in case of the two-phase transaction, the function checks the existence
> >> > of associated foreign prepared transactions before bailing out.
> >> >
> >>
> >> Good catch. But looking at your change, we should not accept the case
> >> where FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) ==
> >> false.
> >>
> >> if (FdwXactParticipants == NIL)
> >> {
> >> /*
> >> * If we are here because of COMMIT/ROLLBACK PREPARED then the
> >> * FdwXactParticipants list would be empty. So we need to
> >> * see if there are any foreign prepared transactions exists
> >> * for this prepared transaction
> >> */
> >> if (TwoPhaseExists(wait_xid))
> >> {
> >> List *foreign_trans = NIL;
> >>
> >> foreign_trans = get_fdwxacts(MyDatabaseId,
> >> wait_xid, InvalidOid, InvalidOid,
> >> false, false, true);
> >>
> >> if (foreign_trans == NIL)
> >> return;
> >> list_free(foreign_trans);
> >> }
> >> }
> >>
> >
> > Sorry my bad, its a mistake on my part. we should just return from the function when
> > FdwXactParticipants == NULL but TwoPhaseExists(wait_xid) == false.
> >
> > if (TwoPhaseExists(wait_xid))
> > {
> > List *foreign_trans = NIL;
> > foreign_trans = get_fdwxacts(MyDatabaseId, wait_xid, InvalidOid, InvalidOid,
> > false, false, true);
> >
> > if (foreign_trans == NIL)
> > return;
> > list_free(foreign_trans);
> > }
> > else
> > return;
> >
> >>
> >> > 9- In function XlogReadFdwXactData() XLogBeginRead call was missing before XLogReadRecord()
> >> > that was causing the crash during recovery.
> >>
> >> Agreed.
> >>
> >> >
> >> > 10- incorporated set_ps_display() signature change.
> >>
> >> Thanks.
> >>
> >> Regarding other changes you did in v19 patch, I have some comments:
> >>
> >> 1.
> >> + ereport(LOG,
> >> + (errmsg("trying to %s the foreign transaction
> >> associated with transaction %u on server %u",
> >> + fdwxact->status ==
> >> FDWXACT_STATUS_COMMITTING?"COMMIT":"ABORT",
> >> + fdwxact->local_xid,
> >> fdwxact->serverid)));
> >> +
> >>
> >> Why do we need to emit LOG message in pg_resolve_foreign_xact() SQL function?
> >
> >
> > That change was not intended to get into the patch file. I had done it during testing to
> > quickly get info on which way the transaction is going to be resolved.
> >
> >>
> >> 2.
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> deleted file mode 120000
> >> index ce8c21880c..0000000000
> >> --- a/src/bin/pg_waldump/fdwxactdesc.c
> >> +++ /dev/null
> >> @@ -1 +0,0 @@
> >> -../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >> \ No newline at end of file
> >> diff --git a/src/bin/pg_waldump/fdwxactdesc.c b/src/bin/pg_waldump/fdwxactdesc.c
> >> new file mode 100644
> >> index 0000000000..ce8c21880c
> >> --- /dev/null
> >> +++ b/src/bin/pg_waldump/fdwxactdesc.c
> >> @@ -0,0 +1 @@
> >> +../../../src/backend/access/rmgrdesc/fdwxactdesc.c
> >>
> >> We need to remove src/bin/pg_waldump/fdwxactdesc.c from the patch.
> >
> >
> > Again sorry! that was an oversight on my part.
> >
> >>
> >> 3.
> >> --- a/doc/src/sgml/monitoring.sgml
> >> +++ b/doc/src/sgml/monitoring.sgml
> >> @@ -1526,14 +1526,14 @@ postgres 27093 0.0 0.0 30096 2752 ?
> >> Ss 11:34 0:00 postgres: ser
> >> <entry><literal>SafeSnapshot</literal></entry>
> >> <entry>Waiting for a snapshot for a <literal>READ ONLY
> >> DEFERRABLE</literal> transaction.</entry>
> >> </row>
> >> - <row>
> >> - <entry><literal>SyncRep</literal></entry>
> >> - <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> - </row>
> >> <row>
> >> <entry><literal>FdwXactResolution</literal></entry>
> >> <entry>Waiting for all foreign transaction participants to
> >> be resolved during atomic commit among foreign servers.</entry>
> >> </row>
> >> + <row>
> >> + <entry><literal>SyncRep</literal></entry>
> >> + <entry>Waiting for confirmation from remote server during
> >> synchronous replication.</entry>
> >> + </row>
> >> <row>
> >> <entry morerows="4"><literal>Timeout</literal></entry>
> >> <entry><literal>BaseBackupThrottle</literal></entry>
> >>
> >> We need to move the entry of FdwXactResolution to right before
> >> Hash/Batch/Allocating for alphabetical order.
> >
> >
> > Agreed!
> >>
> >>
> >> I've incorporated your changes I agreed with to my local branch and
> >> will incorporate other changes after discussion. I'll also do more
> >> test and self-review and will submit the latest version patch.
> >>
> >
> > Meanwhile, I found a couple of more small issues, One is the break statement missing
> > i n pgstat_get_wait_ipc() and secondly fdwxact_relaunch_resolvers()
> > could return un-initialized value.
> > I am attaching a small patch for these changes that can be applied on top of existing
> > patches.
>
> Thank you for the patch!
>
> I'm updating the patches because current behavior in error case would
> not be good. For example, when an error occurs in the prepare phase,
> prepared transactions are left as in-doubt transaction. And these
> transactions are not handled by the resolver process. That means that
> a user could need to resolve these transactions manually every abort
> time, which is not good. In abort case, I think that prepared
> transactions can be resolved by the backend itself, rather than
> leaving them for the resolver. I'll submit the updated patch.
>

I've attached the latest version patch set which includes some changes
from the previous version:

* I've added regression tests that test all types of FDW
implementations. There are three types of FDW: FDW doesn't support any
transaction APIs, FDW supports only commit and rollback APIs and FDW
supports all (prepare, commit and rollback) APISs.
src/test/module/test_fdwxact contains those FDW implementations for
tests, and test some cases where a transaction reads/writes data on
various types of foreign servers.
* Also test_fdwxact has TAP tests that check failure cases. The test
FDW implementation has the ability to inject error or panic into
prepare or commit phase. Using it the TAP test checks if distributed
transactions can be committed or rolled back even in failure cases.
* When foreign_twophase_commit = 'required', the transaction commit
fails if the transaction modified data on even one server not
supporting prepare API. Previously, we used to ignore servers that
don't support any transaction API but we check them to strictly
require all involved foreign servers to support all transaction APIs.
* Transaction resolver process resolves in-doubt transactions automatically.
* Incorporated comments from Muhammad Usama.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Hi Sawada,

I have just done some review and testing of the patches and have

a couple of comments.

1- IMHO the PREPARE TRANSACTION should always use 2PC even

when the transaction has operated on a single foreign server regardless

of foreign_twophase_commit setting, and throw an error otherwise when

2PC is not available on any of the data-modified servers.

For example, consider the case

BEGIN;
INSERT INTO ft_2pc_1 VALUES(1);
PREPARE TRANSACTION 'global_x1';

Here since we are preparing the local transaction so we should also prepare

the transaction on the foreign server even if the transaction has modified only

one foreign table.

What do you think?

Also without this change, the above test case produces an assertion failure

with your patches.

2- when deciding if the two-phase commit is required or not in

FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use

2PC when we have at least one server capable of doing that.

i.e

For FOREIGN_TWOPHASE_COMMIT_PREFER case in

checkForeignTwophaseCommitRequired() function I think
the condition should be

need_twophase_commit = (nserverstwophase >= 1);
instead of
need_twophase_commit = (nserverstwophase >= 2);

I am attaching a patch that I have generated on top of your V20

patches with these two modifications along with the related test case.

Best regards!

...

Muhammad Usama

Highgo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca

ADDR: 10318 WHALLEY BLVD, Surrey, BC

Attachment

v20_gtm_fixes.diff

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

15 May 2020, 02:19:29

On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
> Hi Sawada,
>
> I have just done some review and testing of the patches and have
> a couple of comments.

Thank you for reviewing!

>
> 1- IMHO the PREPARE TRANSACTION should always use 2PC even
> when the transaction has operated on a single foreign server regardless
> of foreign_twophase_commit setting, and throw an error otherwise when
> 2PC is not available on any of the data-modified servers.
>
> For example, consider the case
>
> BEGIN;
> INSERT INTO ft_2pc_1 VALUES(1);
> PREPARE TRANSACTION 'global_x1';
>
> Here since we are preparing the local transaction so we should also prepare
> the transaction on the foreign server even if the transaction has modified only
> one foreign table.
>
> What do you think?

Good catch and I agree with you. The transaction should fail if it
opened a transaction on a 2pc-no-support server regardless of
foreign_twophase_commit. And I think we should prepare a transaction
on a foreign server even if it didn't modify any data on that.

>
> Also without this change, the above test case produces an assertion failure
> with your patches.
>
> 2- when deciding if the two-phase commit is required or not in
> FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
> 2PC when we have at least one server capable of doing that.
>
> i.e
>
> For FOREIGN_TWOPHASE_COMMIT_PREFER case in
> checkForeignTwophaseCommitRequired() function I think
> the condition should be
>
> need_twophase_commit = (nserverstwophase >= 1);
> instead of
> need_twophase_commit = (nserverstwophase >= 2);
>

Hmm I might be missing your point but it seems to me that you want to
use two-phase commit even in the case where a transaction modified
data on only one server. Can't we commit distributed transaction
atomically even using one-phase commit in that case?

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

15 May 2020, 04:25:50

On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
> Hi Sawada,
>
> I have just done some review and testing of the patches and have
> a couple of comments.

Thank you for reviewing!

>
> 1- IMHO the PREPARE TRANSACTION should always use 2PC even
> when the transaction has operated on a single foreign server regardless
> of foreign_twophase_commit setting, and throw an error otherwise when
> 2PC is not available on any of the data-modified servers.
>
> For example, consider the case
>
> BEGIN;
> INSERT INTO ft_2pc_1 VALUES(1);
> PREPARE TRANSACTION 'global_x1';
>
> Here since we are preparing the local transaction so we should also prepare
> the transaction on the foreign server even if the transaction has modified only
> one foreign table.
>
> What do you think?

Good catch and I agree with you. The transaction should fail if it
opened a transaction on a 2pc-no-support server regardless of
foreign_twophase_commit. And I think we should prepare a transaction
on a foreign server even if it didn't modify any data on that.

>
> Also without this change, the above test case produces an assertion failure
> with your patches.
>
> 2- when deciding if the two-phase commit is required or not in
> FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
> 2PC when we have at least one server capable of doing that.
>
> i.e
>
> For FOREIGN_TWOPHASE_COMMIT_PREFER case in
> checkForeignTwophaseCommitRequired() function I think
> the condition should be
>
> need_twophase_commit = (nserverstwophase >= 1);
> instead of
> need_twophase_commit = (nserverstwophase >= 2);
>

Hmm I might be missing your point but it seems to me that you want to
use two-phase commit even in the case where a transaction modified
data on only one server. Can't we commit distributed transaction
atomically even using one-phase commit in that case?

I think you are confusing between nserverstwophase and nserverswritten.

need_twophase_commit = (nserverstwophase >= 1) would mean

use two-phase commit if at least one server exists in the list that is

capable of doing 2PC

For the case when the transaction modified data on only one server we

already exits the function indicating no two-phase required

if (nserverswritten <= 1)
return false;

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Regards,

...

Muhammad Usama

Highgo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca

ADDR: 10318 WHALLEY BLVD, Surrey, BC

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

15 May 2020, 04:58:59

On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> > Hi Sawada,
>> >
>> > I have just done some review and testing of the patches and have
>> > a couple of comments.
>>
>> Thank you for reviewing!
>>
>> >
>> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> > when the transaction has operated on a single foreign server regardless
>> > of foreign_twophase_commit setting, and throw an error otherwise when
>> > 2PC is not available on any of the data-modified servers.
>> >
>> > For example, consider the case
>> >
>> > BEGIN;
>> > INSERT INTO ft_2pc_1 VALUES(1);
>> > PREPARE TRANSACTION 'global_x1';
>> >
>> > Here since we are preparing the local transaction so we should also prepare
>> > the transaction on the foreign server even if the transaction has modified only
>> > one foreign table.
>> >
>> > What do you think?
>>
>> Good catch and I agree with you. The transaction should fail if it
>> opened a transaction on a 2pc-no-support server regardless of
>> foreign_twophase_commit. And I think we should prepare a transaction
>> on a foreign server even if it didn't modify any data on that.
>>
>> >
>> > Also without this change, the above test case produces an assertion failure
>> > with your patches.
>> >
>> > 2- when deciding if the two-phase commit is required or not in
>> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> > 2PC when we have at least one server capable of doing that.
>> >
>> > i.e
>> >
>> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> > checkForeignTwophaseCommitRequired() function I think
>> > the condition should be
>> >
>> > need_twophase_commit = (nserverstwophase >= 1);
>> > instead of
>> > need_twophase_commit = (nserverstwophase >= 2);
>> >
>>
>> Hmm I might be missing your point but it seems to me that you want to
>> use two-phase commit even in the case where a transaction modified
>> data on only one server. Can't we commit distributed transaction
>> atomically even using one-phase commit in that case?
>>
>
> I think you are confusing between nserverstwophase and nserverswritten.
>
> need_twophase_commit = (nserverstwophase >= 1)  would mean
> use two-phase commit if at least one server exists in the list that is
> capable of doing 2PC
>
> For the case when the transaction modified data on only one server we
> already exits the function indicating no two-phase required
>
>     if (nserverswritten <= 1)
>       return false;
>

Thank you for your explanation. If the transaction modified two
servers that don't' support 2pc and one server that supports 2pc I
think we don't want to use 2pc even in 'prefer' case. Because even if
we use 2pc in that case, it's still possible to have the atomic commit
problem. For example, if we failed to commit a transaction after
committing other transactions on the server that doesn't support 2pc
we cannot rollback the already-committed transaction.

On the other hand, in 'prefer' case, if the transaction also modified
the local data, we need to use 2pc even if it modified data on only
one foreign server that supports 2pc. But the current code doesn't
work fine in that case for now. Probably we also need the following
change:

@@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)

    /* Did we modify the local non-temporary data? */
    if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+   {
        nserverswritten++;
+       nserverstwophase++;
+   }

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

15 May 2020, 10:05:52

On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> > Hi Sawada,
>> >
>> > I have just done some review and testing of the patches and have
>> > a couple of comments.
>>
>> Thank you for reviewing!
>>
>> >
>> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> > when the transaction has operated on a single foreign server regardless
>> > of foreign_twophase_commit setting, and throw an error otherwise when
>> > 2PC is not available on any of the data-modified servers.
>> >
>> > For example, consider the case
>> >
>> > BEGIN;
>> > INSERT INTO ft_2pc_1 VALUES(1);
>> > PREPARE TRANSACTION 'global_x1';
>> >
>> > Here since we are preparing the local transaction so we should also prepare
>> > the transaction on the foreign server even if the transaction has modified only
>> > one foreign table.
>> >
>> > What do you think?
>>
>> Good catch and I agree with you. The transaction should fail if it
>> opened a transaction on a 2pc-no-support server regardless of
>> foreign_twophase_commit. And I think we should prepare a transaction
>> on a foreign server even if it didn't modify any data on that.
>>
>> >
>> > Also without this change, the above test case produces an assertion failure
>> > with your patches.
>> >
>> > 2- when deciding if the two-phase commit is required or not in
>> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> > 2PC when we have at least one server capable of doing that.
>> >
>> > i.e
>> >
>> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> > checkForeignTwophaseCommitRequired() function I think
>> > the condition should be
>> >
>> > need_twophase_commit = (nserverstwophase >= 1);
>> > instead of
>> > need_twophase_commit = (nserverstwophase >= 2);
>> >
>>
>> Hmm I might be missing your point but it seems to me that you want to
>> use two-phase commit even in the case where a transaction modified
>> data on only one server. Can't we commit distributed transaction
>> atomically even using one-phase commit in that case?
>>
>
> I think you are confusing between nserverstwophase and nserverswritten.
>
> need_twophase_commit = (nserverstwophase >= 1) would mean
> use two-phase commit if at least one server exists in the list that is
> capable of doing 2PC
>
> For the case when the transaction modified data on only one server we
> already exits the function indicating no two-phase required
>
> if (nserverswritten <= 1)
> return false;
>

Thank you for your explanation. If the transaction modified two
servers that don't' support 2pc and one server that supports 2pc I
think we don't want to use 2pc even in 'prefer' case. Because even if
we use 2pc in that case, it's still possible to have the atomic commit
problem. For example, if we failed to commit a transaction after
committing other transactions on the server that doesn't support 2pc
we cannot rollback the already-committed transaction.

Yes, that is true, And I think the 'prefer' mode will always have a corner case

no matter what. But the thing is we can reduce the probability of hitting

an atomic commit problem by ensuring to use 2PC whenever possible.

For instance as in your example scenario where a transaction modified

two servers that don't support 2PC and one server that supports it. let us

analyze both scenarios.

If we use 2PC on the server that supports it then the probability of hitting

a problem would be 1/3 = 0.33. because there is only one corner case

scenario in that case. which would be if we fail to commit the third server

As the first server (2PC supported one) would be using prepared

transactions so no problem there. The second server (NON-2PC support)

if failed to commit then, still no problem as we can rollback the prepared

transaction on the first server. The only issue would happen when we fail

to commit on the third server because we have already committed

on the second server and there is no way to undo that.

Now consider the other possibility if we do not use the 2PC in that

case (as you mentioned), then the probability of hitting the problem

would be 2/3 = 0.66. because now commit failure on either second or

third server will land us in an atomic-commit-problem.

So, INMO using the 2PC whenever available with 'prefer' mode

should be the way to go.

On the other hand, in 'prefer' case, if the transaction also modified
the local data, we need to use 2pc even if it modified data on only
one foreign server that supports 2pc. But the current code doesn't
work fine in that case for now. Probably we also need the following
change:

@@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)

/* Did we modify the local non-temporary data? */
if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
+ {
nserverswritten++;
+ nserverstwophase++;
+ }

I agree with the part that if the transaction also modifies the local data

then the 2PC should be used.

Though the change you suggested [+ nserverstwophase++;]

would server the purpose and deliver the same results but I think a

better way would be to change need_twophase_commit condition for

prefer mode.

* In 'prefer' case, we prepare transactions on only servers that
* capable of two-phase commit.
*/
- need_twophase_commit = (nserverstwophase >= 2);
+ need_twophase_commit = (nserverstwophase >= 1);
}

The reason I am saying that is. Currently, we do not use 2PC on the local server

in case of distributed transactions, so we should also not count the local server

as one (servers that would be performing the 2PC).

Also I feel the change need_twophase_commit = (nserverstwophase >= 1)

looks more in line with the definition of our 'prefer' mode algorithm.

Do you see an issue with this change?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Regards,

...

Muhammad Usama

Highgo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca

ADDR: 10318 WHALLEY BLVD, Surrey, BC

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

15 May 2020, 14:51:55

On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> > Hi Sawada,
>> >> >
>> >> > I have just done some review and testing of the patches and have
>> >> > a couple of comments.
>> >>
>> >> Thank you for reviewing!
>> >>
>> >> >
>> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> > when the transaction has operated on a single foreign server regardless
>> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> > 2PC is not available on any of the data-modified servers.
>> >> >
>> >> > For example, consider the case
>> >> >
>> >> > BEGIN;
>> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> > PREPARE TRANSACTION 'global_x1';
>> >> >
>> >> > Here since we are preparing the local transaction so we should also prepare
>> >> > the transaction on the foreign server even if the transaction has modified only
>> >> > one foreign table.
>> >> >
>> >> > What do you think?
>> >>
>> >> Good catch and I agree with you. The transaction should fail if it
>> >> opened a transaction on a 2pc-no-support server regardless of
>> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> on a foreign server even if it didn't modify any data on that.
>> >>
>> >> >
>> >> > Also without this change, the above test case produces an assertion failure
>> >> > with your patches.
>> >> >
>> >> > 2- when deciding if the two-phase commit is required or not in
>> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> > 2PC when we have at least one server capable of doing that.
>> >> >
>> >> > i.e
>> >> >
>> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> > checkForeignTwophaseCommitRequired() function I think
>> >> > the condition should be
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> > instead of
>> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >
>> >>
>> >> Hmm I might be missing your point but it seems to me that you want to
>> >> use two-phase commit even in the case where a transaction modified
>> >> data on only one server. Can't we commit distributed transaction
>> >> atomically even using one-phase commit in that case?
>> >>
>> >
>> > I think you are confusing between nserverstwophase and nserverswritten.
>> >
>> > need_twophase_commit = (nserverstwophase >= 1)  would mean
>> > use two-phase commit if at least one server exists in the list that is
>> > capable of doing 2PC
>> >
>> > For the case when the transaction modified data on only one server we
>> > already exits the function indicating no two-phase required
>> >
>> >     if (nserverswritten <= 1)
>> >       return false;
>> >
>>
>> Thank you for your explanation. If the transaction modified two
>> servers that don't' support 2pc and one server that supports 2pc I
>> think we don't want to use 2pc even in 'prefer' case. Because even if
>> we use 2pc in that case, it's still possible to have the atomic commit
>> problem. For example, if we failed to commit a transaction after
>> committing other transactions on the server that doesn't support 2pc
>> we cannot rollback the already-committed transaction.
>
>
> Yes, that is true, And I think the 'prefer' mode will always have a corner case
> no matter what. But the thing is we can reduce the probability of hitting
> an atomic commit problem by ensuring to use 2PC whenever possible.
>
> For instance as in your example scenario where a transaction modified
> two servers that don't support 2PC and one server that supports it. let us
> analyze both scenarios.
>
> If we use 2PC on the server that supports it then the probability of hitting
> a problem would be 1/3 = 0.33. because there is only one corner case
> scenario in that case. which would be if we fail to commit the third server
> As the first server (2PC supported one) would be using prepared
> transactions so no problem there. The second server (NON-2PC support)
> if failed to commit then, still no problem as we can rollback the prepared
> transaction on the first server. The only issue would happen when we fail
> to commit on the third server because we have already committed
> on the second server and there is no way to undo that.
>
>
> Now consider the other possibility if we do not use the 2PC in that
> case (as you mentioned), then the probability of hitting the problem
> would be 2/3 = 0.66. because now commit failure on either second or
> third server will land us in an atomic-commit-problem.
>
> So, INMO using the 2PC whenever available with 'prefer' mode
> should be the way to go.

My understanding of 'prefer' mode is that even if a distributed
transaction modified data on several types of server we can ensure to
keep data consistent among only the local server and foreign servers
that support 2pc. It doesn't ensure anything for other servers that
don't support 2pc. Therefore we use 2pc if the transaction modifies
data on two or more servers that either the local node or servers that
support 2pc.

I understand your argument that using 2pc in that case the possibility
of hitting a problem can decrease but one point we need to consider is
2pc is very high cost. I think basically most users don’t want to use
2pc as much as possible. Please note that it might not work as the
user expected because users cannot specify the commit order and
particular servers might be unstable. I'm not sure that users want to
pay high costs under such conditions. If we want to decrease that
possibility by using 2pc as much as possible, I think it can be yet
another mode so that the user can choose the trade-off.

>
>>
>> On the other hand, in 'prefer' case, if the transaction also modified
>> the local data, we need to use 2pc even if it modified data on only
>> one foreign server that supports 2pc. But the current code doesn't
>> work fine in that case for now. Probably we also need the following
>> change:
>>
>> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>>
>>     /* Did we modify the local non-temporary data? */
>>     if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> +   {
>>         nserverswritten++;
>> +       nserverstwophase++;
>> +   }
>>
>
> I agree with the part that if the transaction also modifies the local data
> then the 2PC should be used.
> Though the change you suggested  [+       nserverstwophase++;]
> would server the purpose and deliver the same results but I think a
> better way would be to change need_twophase_commit condition for
> prefer mode.
>
>
>       * In 'prefer' case, we prepare transactions on only servers that
>       * capable of two-phase commit.
>       */
> -     need_twophase_commit = (nserverstwophase >= 2);
> +    need_twophase_commit = (nserverstwophase >= 1);
>       }
>
>
> The reason I am saying that is. Currently, we do not use 2PC on the local server
> in case of distributed transactions, so we should also not count the local server
> as one (servers that would be performing the 2PC).
> Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
> looks more in line with the definition of our 'prefer' mode algorithm.
>
> Do you see an issue with this change?

I think that with my change we will use 2pc in the case where a
transaction modified data on the local node and one server that
supports 2pc. But with your change, we will use 2pc in more cases, in
addition to the case where a transaction modifies the local and one
2pc-support server. This would fit the definition of 'prefer' you
described but it's still unclear to me that it's better to make
'prefer' mode behave so if we have three values: 'required', 'prefer'
and 'disabled'.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

15 May 2020, 15:54:03

On Fri, May 15, 2020 at 7:52 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> > Hi Sawada,
>> >> >
>> >> > I have just done some review and testing of the patches and have
>> >> > a couple of comments.
>> >>
>> >> Thank you for reviewing!
>> >>
>> >> >
>> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> > when the transaction has operated on a single foreign server regardless
>> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> > 2PC is not available on any of the data-modified servers.
>> >> >
>> >> > For example, consider the case
>> >> >
>> >> > BEGIN;
>> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> > PREPARE TRANSACTION 'global_x1';
>> >> >
>> >> > Here since we are preparing the local transaction so we should also prepare
>> >> > the transaction on the foreign server even if the transaction has modified only
>> >> > one foreign table.
>> >> >
>> >> > What do you think?
>> >>
>> >> Good catch and I agree with you. The transaction should fail if it
>> >> opened a transaction on a 2pc-no-support server regardless of
>> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> on a foreign server even if it didn't modify any data on that.
>> >>
>> >> >
>> >> > Also without this change, the above test case produces an assertion failure
>> >> > with your patches.
>> >> >
>> >> > 2- when deciding if the two-phase commit is required or not in
>> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> > 2PC when we have at least one server capable of doing that.
>> >> >
>> >> > i.e
>> >> >
>> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> > checkForeignTwophaseCommitRequired() function I think
>> >> > the condition should be
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> > instead of
>> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >
>> >>
>> >> Hmm I might be missing your point but it seems to me that you want to
>> >> use two-phase commit even in the case where a transaction modified
>> >> data on only one server. Can't we commit distributed transaction
>> >> atomically even using one-phase commit in that case?
>> >>
>> >
>> > I think you are confusing between nserverstwophase and nserverswritten.
>> >
>> > need_twophase_commit = (nserverstwophase >= 1) would mean
>> > use two-phase commit if at least one server exists in the list that is
>> > capable of doing 2PC
>> >
>> > For the case when the transaction modified data on only one server we
>> > already exits the function indicating no two-phase required
>> >
>> > if (nserverswritten <= 1)
>> > return false;
>> >
>>
>> Thank you for your explanation. If the transaction modified two
>> servers that don't' support 2pc and one server that supports 2pc I
>> think we don't want to use 2pc even in 'prefer' case. Because even if
>> we use 2pc in that case, it's still possible to have the atomic commit
>> problem. For example, if we failed to commit a transaction after
>> committing other transactions on the server that doesn't support 2pc
>> we cannot rollback the already-committed transaction.
>
>
> Yes, that is true, And I think the 'prefer' mode will always have a corner case
> no matter what. But the thing is we can reduce the probability of hitting
> an atomic commit problem by ensuring to use 2PC whenever possible.
>
> For instance as in your example scenario where a transaction modified
> two servers that don't support 2PC and one server that supports it. let us
> analyze both scenarios.
>
> If we use 2PC on the server that supports it then the probability of hitting
> a problem would be 1/3 = 0.33. because there is only one corner case
> scenario in that case. which would be if we fail to commit the third server
> As the first server (2PC supported one) would be using prepared
> transactions so no problem there. The second server (NON-2PC support)
> if failed to commit then, still no problem as we can rollback the prepared
> transaction on the first server. The only issue would happen when we fail
> to commit on the third server because we have already committed
> on the second server and there is no way to undo that.
>
>
> Now consider the other possibility if we do not use the 2PC in that
> case (as you mentioned), then the probability of hitting the problem
> would be 2/3 = 0.66. because now commit failure on either second or
> third server will land us in an atomic-commit-problem.
>
> So, INMO using the 2PC whenever available with 'prefer' mode
> should be the way to go.

My understanding of 'prefer' mode is that even if a distributed
transaction modified data on several types of server we can ensure to
keep data consistent among only the local server and foreign servers
that support 2pc. It doesn't ensure anything for other servers that
don't support 2pc. Therefore we use 2pc if the transaction modifies
data on two or more servers that either the local node or servers that
support 2pc.

I understand your argument that using 2pc in that case the possibility
of hitting a problem can decrease but one point we need to consider is
2pc is very high cost. I think basically most users don’t want to use
2pc as much as possible. Please note that it might not work as the
user expected because users cannot specify the commit order and
particular servers might be unstable. I'm not sure that users want to
pay high costs under such conditions. If we want to decrease that
possibility by using 2pc as much as possible, I think it can be yet
another mode so that the user can choose the trade-off.

>
>>
>> On the other hand, in 'prefer' case, if the transaction also modified
>> the local data, we need to use 2pc even if it modified data on only
>> one foreign server that supports 2pc. But the current code doesn't
>> work fine in that case for now. Probably we also need the following
>> change:
>>
>> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>>
>> /* Did we modify the local non-temporary data? */
>> if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> + {
>> nserverswritten++;
>> + nserverstwophase++;
>> + }
>>
>
> I agree with the part that if the transaction also modifies the local data
> then the 2PC should be used.
> Though the change you suggested [+ nserverstwophase++;]
> would server the purpose and deliver the same results but I think a
> better way would be to change need_twophase_commit condition for
> prefer mode.
>
>
> * In 'prefer' case, we prepare transactions on only servers that
> * capable of two-phase commit.
> */
> - need_twophase_commit = (nserverstwophase >= 2);
> + need_twophase_commit = (nserverstwophase >= 1);
> }
>
>
> The reason I am saying that is. Currently, we do not use 2PC on the local server
> in case of distributed transactions, so we should also not count the local server
> as one (servers that would be performing the 2PC).
> Also I feel the change need_twophase_commit = (nserverstwophase >= 1)
> looks more in line with the definition of our 'prefer' mode algorithm.
>
> Do you see an issue with this change?

I think that with my change we will use 2pc in the case where a
transaction modified data on the local node and one server that
supports 2pc. But with your change, we will use 2pc in more cases, in
addition to the case where a transaction modifies the local and one
2pc-support server. This would fit the definition of 'prefer' you
described but it's still unclear to me that it's better to make
'prefer' mode behave so if we have three values: 'required', 'prefer'
and 'disabled'.

Thanks for the detailed explanation, now I have a better understanding of the

reasons why we were going for a different solution to the problem.

You are right my understanding of 'prefer' mode is we must use 2PC as much

as possible, and reason for that was the world prefer as per my understanding

means "it's more desirable/better to use than another or others"

So the way I understood the FOREIGN_TWOPHASE_COMMIT_PREFER

was that we would use 2PC in the maximum possible of cases, and the user

would already have the expectation that 2PC is more expensive than 1PC.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Regards,

...

Muhammad Usama

Highgo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca

ADDR: 10318 WHALLEY BLVD, Surrey, BC

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

19 May 2020, 07:02:12

On Sat, 16 May 2020 at 00:54, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Fri, May 15, 2020 at 7:52 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Fri, 15 May 2020 at 19:06, Muhammad Usama <m.usama@gmail.com> wrote:
>> >
>> >
>> >
>> > On Fri, May 15, 2020 at 9:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Fri, 15 May 2020 at 13:26, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Fri, May 15, 2020 at 7:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >> >>
>> >> >> On Fri, 15 May 2020 at 03:08, Muhammad Usama <m.usama@gmail.com> wrote:
>> >> >> >
>> >> >> >
>> >> >> > Hi Sawada,
>> >> >> >
>> >> >> > I have just done some review and testing of the patches and have
>> >> >> > a couple of comments.
>> >> >>
>> >> >> Thank you for reviewing!
>> >> >>
>> >> >> >
>> >> >> > 1- IMHO the PREPARE TRANSACTION should always use 2PC even
>> >> >> > when the transaction has operated on a single foreign server regardless
>> >> >> > of foreign_twophase_commit setting, and throw an error otherwise when
>> >> >> > 2PC is not available on any of the data-modified servers.
>> >> >> >
>> >> >> > For example, consider the case
>> >> >> >
>> >> >> > BEGIN;
>> >> >> > INSERT INTO ft_2pc_1 VALUES(1);
>> >> >> > PREPARE TRANSACTION 'global_x1';
>> >> >> >
>> >> >> > Here since we are preparing the local transaction so we should also prepare
>> >> >> > the transaction on the foreign server even if the transaction has modified only
>> >> >> > one foreign table.
>> >> >> >
>> >> >> > What do you think?
>> >> >>
>> >> >> Good catch and I agree with you. The transaction should fail if it
>> >> >> opened a transaction on a 2pc-no-support server regardless of
>> >> >> foreign_twophase_commit. And I think we should prepare a transaction
>> >> >> on a foreign server even if it didn't modify any data on that.
>> >> >>
>> >> >> >
>> >> >> > Also without this change, the above test case produces an assertion failure
>> >> >> > with your patches.
>> >> >> >
>> >> >> > 2- when deciding if the two-phase commit is required or not in
>> >> >> > FOREIGN_TWOPHASE_COMMIT_PREFER mode we should use
>> >> >> > 2PC when we have at least one server capable of doing that.
>> >> >> >
>> >> >> > i.e
>> >> >> >
>> >> >> > For FOREIGN_TWOPHASE_COMMIT_PREFER case in
>> >> >> > checkForeignTwophaseCommitRequired() function I think
>> >> >> > the condition should be
>> >> >> >
>> >> >> > need_twophase_commit = (nserverstwophase >= 1);
>> >> >> > instead of
>> >> >> > need_twophase_commit = (nserverstwophase >= 2);
>> >> >> >
>> >> >>
>> >> >> Hmm I might be missing your point but it seems to me that you want to
>> >> >> use two-phase commit even in the case where a transaction modified
>> >> >> data on only one server. Can't we commit distributed transaction
>> >> >> atomically even using one-phase commit in that case?
>> >> >>
>> >> >
>> >> > I think you are confusing between nserverstwophase and nserverswritten.
>> >> >
>> >> > need_twophase_commit = (nserverstwophase >= 1)  would mean
>> >> > use two-phase commit if at least one server exists in the list that is
>> >> > capable of doing 2PC
>> >> >
>> >> > For the case when the transaction modified data on only one server we
>> >> > already exits the function indicating no two-phase required
>> >> >
>> >> >     if (nserverswritten <= 1)
>> >> >       return false;
>> >> >
>> >>
>> >> Thank you for your explanation. If the transaction modified two
>> >> servers that don't' support 2pc and one server that supports 2pc I
>> >> think we don't want to use 2pc even in 'prefer' case. Because even if
>> >> we use 2pc in that case, it's still possible to have the atomic commit
>> >> problem. For example, if we failed to commit a transaction after
>> >> committing other transactions on the server that doesn't support 2pc
>> >> we cannot rollback the already-committed transaction.
>> >
>> >
>> > Yes, that is true, And I think the 'prefer' mode will always have a corner case
>> > no matter what. But the thing is we can reduce the probability of hitting
>> > an atomic commit problem by ensuring to use 2PC whenever possible.
>> >
>> > For instance as in your example scenario where a transaction modified
>> > two servers that don't support 2PC and one server that supports it. let us
>> > analyze both scenarios.
>> >
>> > If we use 2PC on the server that supports it then the probability of hitting
>> > a problem would be 1/3 = 0.33. because there is only one corner case
>> > scenario in that case. which would be if we fail to commit the third server
>> > As the first server (2PC supported one) would be using prepared
>> > transactions so no problem there. The second server (NON-2PC support)
>> > if failed to commit then, still no problem as we can rollback the prepared
>> > transaction on the first server. The only issue would happen when we fail
>> > to commit on the third server because we have already committed
>> > on the second server and there is no way to undo that.
>> >
>> >
>> > Now consider the other possibility if we do not use the 2PC in that
>> > case (as you mentioned), then the probability of hitting the problem
>> > would be 2/3 = 0.66. because now commit failure on either second or
>> > third server will land us in an atomic-commit-problem.
>> >
>> > So, INMO using the 2PC whenever available with 'prefer' mode
>> > should be the way to go.
>>
>> My understanding of 'prefer' mode is that even if a distributed
>> transaction modified data on several types of server we can ensure to
>> keep data consistent among only the local server and foreign servers
>> that support 2pc. It doesn't ensure anything for other servers that
>> don't support 2pc. Therefore we use 2pc if the transaction modifies
>> data on two or more servers that either the local node or servers that
>> support 2pc.
>>
>> I understand your argument that using 2pc in that case the possibility
>> of hitting a problem can decrease but one point we need to consider is
>> 2pc is very high cost. I think basically most users don’t want to use
>> 2pc as much as possible. Please note that it might not work as the
>> user expected because users cannot specify the commit order and
>> particular servers might be unstable. I'm not sure that users want to
>> pay high costs under such conditions. If we want to decrease that
>> possibility by using 2pc as much as possible, I think it can be yet
>> another mode so that the user can choose the trade-off.
>>
>> >
>> >>
>> >> On the other hand, in 'prefer' case, if the transaction also modified
>> >> the local data, we need to use 2pc even if it modified data on only
>> >> one foreign server that supports 2pc. But the current code doesn't
>> >> work fine in that case for now. Probably we also need the following
>> >> change:
>> >>
>> >> @@ -540,7 +540,10 @@ checkForeignTwophaseCommitRequired(void)
>> >>
>> >>     /* Did we modify the local non-temporary data? */
>> >>     if ((MyXactFlags & XACT_FLAGS_WROTENONTEMPREL) != 0)
>> >> +   {
>> >>         nserverswritten++;
>> >> +       nserverstwophase++;
>> >> +   }
>> >>
>> >
>> > I agree with the part that if the transaction also modifies the local data
>> > then the 2PC should be used.
>> > Though the change you suggested  [+       nserverstwophase++;]
>> > would server the purpose and deliver the same results but I think a
>> > better way would be to change need_twophase_commit condition for
>> > prefer mode.
>> >
>> >
>> >       * In 'prefer' case, we prepare transactions on only servers that
>> >       * capable of two-phase commit.
>> >       */
>> > -     need_twophase_commit = (nserverstwophase >= 2);
>> > +    need_twophase_commit = (nserverstwophase >= 1);
>> >       }
>> >
>> >
>> > The reason I am saying that is. Currently, we do not use 2PC on the local server
>> > in case of distributed transactions, so we should also not count the local server
>> > as one (servers that would be performing the 2PC).
>> > Also I feel the change  need_twophase_commit = (nserverstwophase >= 1)
>> > looks more in line with the definition of our 'prefer' mode algorithm.
>> >
>> > Do you see an issue with this change?
>>
>> I think that with my change we will use 2pc in the case where a
>> transaction modified data on the local node and one server that
>> supports 2pc. But with your change, we will use 2pc in more cases, in
>> addition to the case where a transaction modifies the local and one
>> 2pc-support server. This would fit the definition of 'prefer' you
>> described but it's still unclear to me that it's better to make
>> 'prefer' mode behave so if we have three values: 'required', 'prefer'
>> and 'disabled'.
>>
>
> Thanks for the detailed explanation, now I have a better understanding of the
> reasons why we were going for a different solution to the problem.
> You are right my understanding of 'prefer' mode is we must use 2PC as much
> as possible, and reason for that was the world prefer as per my understanding
> means "it's more desirable/better to use than another or others"
> So the way I understood the FOREIGN_TWOPHASE_COMMIT_PREFER
> was that we would use 2PC in the maximum possible of cases, and the user
> would already have the expectation that 2PC is more expensive than 1PC.
>

I think that the current three values are useful for users. The
‘required’ mode is used when users want to ensure all writes involved
with the transaction are committed atomically. That being said, as
some FDW plugin might not support the prepare API we cannot force
users to use this mode all the time when using atomic commit.
Therefore ‘prefer’ mode would be useful for this case. Both modes use
2pc only when it's required for atomic commit.

So what do you think my idea that adding the behavior you proposed as
another new mode? As it’s better to keep the first version simple as
much as possible It might not be added to the first version but this
behavior might be useful in some cases.

I've attached a new version patch that incorporates some bug fixes
reported by Muhammad. Please review them.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Thu, 4 Jun 2020 at 12:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 3, 2020 at 12:02 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 3 Jun 2020 at 14:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > If the intention is to keep the first version simple, then why do we
> > > want to support any mode other than 'required'?  I think it will limit
> > > its usage for the cases where 2PC can be used only when all FDWs
> > > involved support Prepare API but if that helps to keep the design and
> > > patch simpler then why not just do that for the first version and then
> > > extend it later.  OTOH, if you think it will be really useful to keep
> > > other modes, then also we could try to keep those in separate patches
> > > to facilitate the review and discussion of the core feature.
> >
> > ‘disabled’ is the fundamental mode.

Oops, I wanted to say 'required' is the fundamental mode.

> > We also need 'disabled' mode,
> > otherwise existing FDW won't work.
> >
>
> IIUC, if foreign_twophase_commit is 'disabled', we don't use a
> two-phase protocol to commit distributed transactions, right?  So, do
> we check this at the time of Prepare or Commit whether we need to use
> a two-phase protocol?  I think this should be checked at prepare time.

When a client executes COMMIT to a distributed transaction, 2pc is
automatically, transparently used. In ‘required’ case, all involved
(and modified) foreign server needs to support 2pc. So if a
distributed transaction modifies data on a foreign server connected
via an existing FDW which doesn’t support 2pc, the transaction cannot
proceed commit, fails at pre-commit phase. So there should be two
modes: ‘disabled’ and ‘required’, and should be ‘disabled’ by default.

>
> +        <para>
> +         This parameter can be changed at any time; the behavior for any one
> +         transaction is determined by the setting in effect when it commits.
> +        </para>
>
> This is written w.r.t foreign_twophase_commit.  If one changes this
> between prepare and commit, will it have any impact?

Since the distributed transaction commit automatically uses 2pc when
executing COMMIT, it's not possible to change foreign_twophase_commit
between prepare and commit. So I'd like to explain the case where a
user executes PREPARE and then COMMIT PREPARED while changing
foreign_twophase_commit.

PREPARE can run only when foreign_twophase_commit is 'required' (or
'prefer') and all foreign servers involved with the transaction
support 2pc. We prepare all foreign transactions no matter what the
number of servers and modified or not. If either
foreign_twophase_commit is 'disabled' or the transaction modifies data
on a foreign server that doesn't support 2pc, it raises an error. At
COMMIT (or ROLLBACK) PREPARED, similarly foreign_twophase_commit needs
to be set to 'required'. It raises an error if the distributed
transaction has a foreign transaction and foreign_twophase_commit is
'disabled'.

>
> >  I was concerned that many FDW
> > plugins don't implement FDW transaction APIs yet when users start
> > using this feature. But it seems to be a good idea to move 'prefer'
> > mode to a separate patch while leaving 'required'. I'll do that in the
> > next version patch.
> >
>
> Okay, thanks.  Please, see if you can separate out the documentation
> for that as well.
>
> Few other comments on v21-0003-Documentation-update:
> ----------------------------------------------------
> 1.
> +      <entry></entry>
> +      <entry>
> +       Numeric transaction identifier with that this foreign transaction
> +       associates
> +      </entry>
>
> /with that this/with which this
>
> 2.
> +      <entry>
> +       The OID of the foreign server on that the foreign transaction
> is prepared
> +      </entry>
>
> /on that the/on which the
>
> 3.
> +      <entry><structfield>status</structfield></entry>
> +      <entry><type>text</type></entry>
> +      <entry></entry>
> +      <entry>
> +       Status of foreign transaction. Possible values are:
> +       <itemizedlist>
> +        <listitem>
> +         <para>
> +          <literal>initial</literal> : Initial status.
> +         </para>
>
> What exactly "Initial status" means?

This part is out-of-date. Fixed.

>
> 4.
> +      <entry><structfield>in_doubt</structfield></entry>
> +      <entry><type>boolean</type></entry>
> +      <entry></entry>
> +      <entry>
> +       If <literal>true</literal> this foreign transaction is
> in-doubt status and
> +       needs to be resolved by calling <function>pg_resolve_fdwxact</function>
> +       function.
> +      </entry>
>
> It would be better if you can add an additional sentence to say when
> and or how can foreign transactions reach in-doubt state.
>
> 5.
> If <literal>N</literal> local transactions each
> +         across <literal>K</literal> foreign server this value need to be set
>
> This part of the sentence can be improved by saying something like:
> "If a user expects N local transactions and each of those involves K
> foreign servers, this value..".

Thanks. I've incorporated all your comments.

I've attached the new version patch set. 0006 is a separate patch
which introduces 'prefer' mode to foreign_twophase_commit.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Wed, 17 Jun 2020 at 14:07, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 17 Jun 2020 at 09:01, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >
> > > I've attached the new version patch set. 0006 is a separate patch
> > > which introduces 'prefer' mode to foreign_twophase_commit.
> >
> > I hope we can use this feature. Thank you for making patches and
> > discussions.
> > I'm currently understanding the logic and found some minor points to be
> > fixed.
> >
> > I'm sorry if my understanding is wrong.
> >
> > * The v22 patches need rebase as they can't apply to the current master.
> >
> > * FdwXactAtomicCommitParticipants said in
> > src/backend/access/fdwxact/README
> >    is not implemented. Is FdwXactParticipants right?
>
> Right.
>
> >
> > * A following comment says that this code is for "One-phase",
> >    but second argument of FdwXactParticipantEndTransaction() describes
> >    this code is not "onephase".
> >
> > AtEOXact_FdwXact() in fdwxact.c
> >         /* One-phase rollback foreign transaction */
> >         FdwXactParticipantEndTransaction(fdw_part, false, false);
> >
> > static void
> > FdwXactParticipantEndTransaction(FdwXactParticipant *fdw_part, bool
> > onephase,
> >         bool for_commit)
> >
> > * "two_phase_commit" option is mentioned in postgres-fdw.sgml,
> >     but I can't find related code.
> >
> > * resolver.c comments have the sentence
> >    containing two blanks.(Emergency  Termination)
> >
> > * There are some inconsistency with PostgreSQL wiki.
> > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> >
> >    I understand it's difficult to keep consistency, I think it's ok to
> > fix later
> >    when these patches almost be able to be committed.
> >
> >    - I can't find "two_phase_commit" option in the source code.
> >      But 2PC is work if the remote server's "max_prepared_transactions"
> > is set
> >      to non zero value. It is correct work, isn't it?
>
> Yes. I had removed two_phase_commit option from postgres_fdw.
> Currently, postgres_fdw uses 2pc when 2pc is required. Therefore,
> max_prepared_transactions needs to be set to more than one, as you
> mentioned.
>
> >
> >    - some parameters are renamed or added in latest patches.
> >      max_prepared_foreign_transaction, max_prepared_transactions and so
> > on.
> >
> >    - typo: froeign_transaction_resolver_timeout
> >
>
> Thank you for your review! I've incorporated your comments on the
> local branch. I'll share the latest version patch.
>
> Also, I've updated the wiki page. I'll try to keep the wiki page up-to-date.
>

I've attached the latest version patches. I've incorporated the review
comments I got so far and improved locking strategy.

Please review it.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/16 14:47, Masahiko Sawada wrote:
> > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>> I've attached the latest version patches. I've incorporated the review
> >>>> comments I got so far and improved locking strategy.
> >>>
> >>> Thanks for updating the patch!
> >>
> >> +1
> >> I'm interested in these patches and now studying them. While checking
> >> the behaviors of the patched PostgreSQL, I got three comments.
> >
> > Thank you for testing this patch!
> >
> >>
> >> 1. We can access to the foreign table even during recovery in the HEAD.
> >> But in the patched version, when I did that, I got the following error.
> >> Is this intentional?
> >>
> >> ERROR:  cannot assign TransactionIds during recovery
> >
> > No, it should be fixed. I'm going to fix this by not collecting
> > participants for atomic commit during recovery.
>
> Thanks for trying to fix the issues!
>
> I'd like to report one more issue. When I started new transaction
> in the local server, executed INSERT in the remote server via
> postgres_fdw and then quit psql, I got the following assertion failure.
>
> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> 5   postgres                            0x000000010d313da3 proc_exit + 19
> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> 11  postgres                            0x000000010d1597b9 main + 761
> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> 13  ???                                 0x0000000000000003 0x0 + 3
>

Thank you for reporting the issue!

I've attached the latest version patch that incorporated all comments
I got so far. I've removed the patch adding the 'prefer' mode of
foreign_twophase_commit to keep the patch set simple.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ahsan Hadi

Date:

23 July 2020, 07:34:23

On Fri, Jul 17, 2020 at 9:56 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:

On 2020/07/16 14:47, Masahiko Sawada wrote:
> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>> I've attached the latest version patches. I've incorporated the review
>>>> comments I got so far and improved locking strategy.
>>>
>>> Thanks for updating the patch!
>>
>> +1
>> I'm interested in these patches and now studying them. While checking
>> the behaviors of the patched PostgreSQL, I got three comments.
>
> Thank you for testing this patch!
>
>>
>> 1. We can access to the foreign table even during recovery in the HEAD.
>> But in the patched version, when I did that, I got the following error.
>> Is this intentional?
>>
>> ERROR: cannot assign TransactionIds during recovery
>
> No, it should be fixed. I'm going to fix this by not collecting
> participants for atomic commit during recovery.

Thanks for trying to fix the issues!

I'd like to report one more issue. When I started new transaction
in the local server, executed INSERT in the remote server via
postgres_fdw and then quit psql, I got the following assertion failure.

TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
0 postgres 0x000000010d52f3c0 ExceptionalCondition + 160
1 postgres 0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
2 postgres 0x000000010cefff14 AtProcExit_FdwXact + 20
3 postgres 0x000000010d313fe3 shmem_exit + 179
4 postgres 0x000000010d313e7a proc_exit_prepare + 122
5 postgres 0x000000010d313da3 proc_exit + 19
6 postgres 0x000000010d35112f PostgresMain + 3711
7 postgres 0x000000010d27bb3a BackendRun + 570
8 postgres 0x000000010d27af6b BackendStartup + 475
9 postgres 0x000000010d279ed1 ServerLoop + 593
10 postgres 0x000000010d277940 PostmasterMain + 6016
11 postgres 0x000000010d1597b9 main + 761
12 libdyld.dylib 0x00007fff7161e3d5 start + 1
13 ??? 0x0000000000000003 0x0 + 3

I have done a test with the latest set of patches shared by Swada and I am not able to reproduce this issue. Started a prepared transaction on the local server and then did a couple of inserts in a remote table using postgres_fdw and the quit psql. I am not able to reproduce the assertion failure.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.hadi@highgo.ca

Re: Transactions involving multiple postgres foreign servers, take 2

From

Muhammad Usama

Date:

23 July 2020, 13:50:47

On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/16 14:47, Masahiko Sawada wrote:
> > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>> I've attached the latest version patches. I've incorporated the review
> >>>> comments I got so far and improved locking strategy.
> >>>
> >>> Thanks for updating the patch!
> >>
> >> +1
> >> I'm interested in these patches and now studying them. While checking
> >> the behaviors of the patched PostgreSQL, I got three comments.
> >
> > Thank you for testing this patch!
> >
> >>
> >> 1. We can access to the foreign table even during recovery in the HEAD.
> >> But in the patched version, when I did that, I got the following error.
> >> Is this intentional?
> >>
> >> ERROR: cannot assign TransactionIds during recovery
> >
> > No, it should be fixed. I'm going to fix this by not collecting
> > participants for atomic commit during recovery.
>
> Thanks for trying to fix the issues!
>
> I'd like to report one more issue. When I started new transaction
> in the local server, executed INSERT in the remote server via
> postgres_fdw and then quit psql, I got the following assertion failure.
>
> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> 0 postgres 0x000000010d52f3c0 ExceptionalCondition + 160
> 1 postgres 0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> 2 postgres 0x000000010cefff14 AtProcExit_FdwXact + 20
> 3 postgres 0x000000010d313fe3 shmem_exit + 179
> 4 postgres 0x000000010d313e7a proc_exit_prepare + 122
> 5 postgres 0x000000010d313da3 proc_exit + 19
> 6 postgres 0x000000010d35112f PostgresMain + 3711
> 7 postgres 0x000000010d27bb3a BackendRun + 570
> 8 postgres 0x000000010d27af6b BackendStartup + 475
> 9 postgres 0x000000010d279ed1 ServerLoop + 593
> 10 postgres 0x000000010d277940 PostmasterMain + 6016
> 11 postgres 0x000000010d1597b9 main + 761
> 12 libdyld.dylib 0x00007fff7161e3d5 start + 1
> 13 ??? 0x0000000000000003 0x0 + 3
>

Thank you for reporting the issue!

I've attached the latest version patch that incorporated all comments
I got so far. I've removed the patch adding the 'prefer' mode of
foreign_twophase_commit to keep the patch set simple.

I have started to review the patchset. Just a quick comment.

Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch

contains changes (adding fdwxact includes) for
src/backend/executor/nodeForeignscan.c, src/backend/executor/nodeModifyTable.c

and src/backend/executor/execPartition.c files that doesn't seem to be

required with the latest version.

Thanks

Best regards

Muhammad Usama

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

27 July 2020, 06:59:45

On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>
>
>
> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> >
>> >
>> >
>> > On 2020/07/16 14:47, Masahiko Sawada wrote:
>> > > On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> > >>
>> > >>
>> > >>
>> > >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>> > >>>> I've attached the latest version patches. I've incorporated the review
>> > >>>> comments I got so far and improved locking strategy.
>> > >>>
>> > >>> Thanks for updating the patch!
>> > >>
>> > >> +1
>> > >> I'm interested in these patches and now studying them. While checking
>> > >> the behaviors of the patched PostgreSQL, I got three comments.
>> > >
>> > > Thank you for testing this patch!
>> > >
>> > >>
>> > >> 1. We can access to the foreign table even during recovery in the HEAD.
>> > >> But in the patched version, when I did that, I got the following error.
>> > >> Is this intentional?
>> > >>
>> > >> ERROR:  cannot assign TransactionIds during recovery
>> > >
>> > > No, it should be fixed. I'm going to fix this by not collecting
>> > > participants for atomic commit during recovery.
>> >
>> > Thanks for trying to fix the issues!
>> >
>> > I'd like to report one more issue. When I started new transaction
>> > in the local server, executed INSERT in the remote server via
>> > postgres_fdw and then quit psql, I got the following assertion failure.
>> >
>> > TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>> > 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>> > 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>> > 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>> > 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>> > 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>> > 5   postgres                            0x000000010d313da3 proc_exit + 19
>> > 6   postgres                            0x000000010d35112f PostgresMain + 3711
>> > 7   postgres                            0x000000010d27bb3a BackendRun + 570
>> > 8   postgres                            0x000000010d27af6b BackendStartup + 475
>> > 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>> > 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>> > 11  postgres                            0x000000010d1597b9 main + 761
>> > 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>> > 13  ???                                 0x0000000000000003 0x0 + 3
>> >
>>
>> Thank you for reporting the issue!
>>
>> I've attached the latest version patch that incorporated all comments
>> I got so far. I've removed the patch adding the 'prefer' mode of
>> foreign_twophase_commit to keep the patch set simple.
>
>
> I have started to review the patchset. Just a quick comment.
>
> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> contains changes (adding fdwxact includes) for
> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> and  src/backend/executor/execPartition.c files that doesn't seem to be
> required with the latest version.

Thanks for your comment.

Right. I've removed these changes on the local branch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

20 August 2020, 15:36:40


On 2020/07/27 15:59, Masahiko Sawada wrote:
> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>
>>
>>
>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>
>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>
>>>>>>> Thanks for updating the patch!
>>>>>>
>>>>>> +1
>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>
>>>>> Thank you for testing this patch!
>>>>>
>>>>>>
>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>> Is this intentional?
>>>>>>
>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>
>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>> participants for atomic commit during recovery.
>>>>
>>>> Thanks for trying to fix the issues!
>>>>
>>>> I'd like to report one more issue. When I started new transaction
>>>> in the local server, executed INSERT in the remote server via
>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>
>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>
>>>
>>> Thank you for reporting the issue!
>>>
>>> I've attached the latest version patch that incorporated all comments
>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>> foreign_twophase_commit to keep the patch set simple.
>>
>>
>> I have started to review the patchset. Just a quick comment.
>>
>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>> contains changes (adding fdwxact includes) for
>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>> required with the latest version.
> 
> Thanks for your comment.
> 
> Right. I've removed these changes on the local branch.

The latest patches failed to be applied to the master branch. Could you rebase the patches?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

21 August 2020, 06:25:29

On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/07/27 15:59, Masahiko Sawada wrote:
> > On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>>
> >>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
> >>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>>>>>> I've attached the latest version patches. I've incorporated the review
> >>>>>>>> comments I got so far and improved locking strategy.
> >>>>>>>
> >>>>>>> Thanks for updating the patch!
> >>>>>>
> >>>>>> +1
> >>>>>> I'm interested in these patches and now studying them. While checking
> >>>>>> the behaviors of the patched PostgreSQL, I got three comments.
> >>>>>
> >>>>> Thank you for testing this patch!
> >>>>>
> >>>>>>
> >>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
> >>>>>> But in the patched version, when I did that, I got the following error.
> >>>>>> Is this intentional?
> >>>>>>
> >>>>>> ERROR:  cannot assign TransactionIds during recovery
> >>>>>
> >>>>> No, it should be fixed. I'm going to fix this by not collecting
> >>>>> participants for atomic commit during recovery.
> >>>>
> >>>> Thanks for trying to fix the issues!
> >>>>
> >>>> I'd like to report one more issue. When I started new transaction
> >>>> in the local server, executed INSERT in the remote server via
> >>>> postgres_fdw and then quit psql, I got the following assertion failure.
> >>>>
> >>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> >>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> >>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> >>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> >>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> >>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> >>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
> >>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> >>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> >>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> >>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> >>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> >>>> 11  postgres                            0x000000010d1597b9 main + 761
> >>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> >>>> 13  ???                                 0x0000000000000003 0x0 + 3
> >>>>
> >>>
> >>> Thank you for reporting the issue!
> >>>
> >>> I've attached the latest version patch that incorporated all comments
> >>> I got so far. I've removed the patch adding the 'prefer' mode of
> >>> foreign_twophase_commit to keep the patch set simple.
> >>
> >>
> >> I have started to review the patchset. Just a quick comment.
> >>
> >> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> >> contains changes (adding fdwxact includes) for
> >> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> >> and  src/backend/executor/execPartition.c files that doesn't seem to be
> >> required with the latest version.
> >
> > Thanks for your comment.
> >
> > Right. I've removed these changes on the local branch.
>
> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>

Thank you for letting me know. I've attached the latest version patch set.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> Thank you for letting me know. I've attached the latest version patch set.

This needs a rebase.  Patch 0002 is conflicting with some of the
recent changes done in syncrep.c and procarray.c, at least.
--
Michael

Attachment

signature.asc

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

07 September 2020, 08:59:07


On 2020/08/21 15:25, Masahiko Sawada wrote:
> On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/07/27 15:59, Masahiko Sawada wrote:
>>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>>>
>>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>>>
>>>>>>>>> Thanks for updating the patch!
>>>>>>>>
>>>>>>>> +1
>>>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>>>
>>>>>>> Thank you for testing this patch!
>>>>>>>
>>>>>>>>
>>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>>>> Is this intentional?
>>>>>>>>
>>>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>>>
>>>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>>>> participants for atomic commit during recovery.
>>>>>>
>>>>>> Thanks for trying to fix the issues!
>>>>>>
>>>>>> I'd like to report one more issue. When I started new transaction
>>>>>> in the local server, executed INSERT in the remote server via
>>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>>>
>>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>>>
>>>>>
>>>>> Thank you for reporting the issue!
>>>>>
>>>>> I've attached the latest version patch that incorporated all comments
>>>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>>>> foreign_twophase_commit to keep the patch set simple.
>>>>
>>>>
>>>> I have started to review the patchset. Just a quick comment.
>>>>
>>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>>>> contains changes (adding fdwxact includes) for
>>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>>>> required with the latest version.
>>>
>>> Thanks for your comment.
>>>
>>> Right. I've removed these changes on the local branch.
>>
>> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>>
> 
> Thank you for letting me know. I've attached the latest version patch set.

Thanks for updating the patch!

IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually. What about separating
the feature into the following parts?

#1
Originally the server just executed xact callback that each FDW registered
when the transaction was committed. The patch changes this so that
the server manages the participants of FDW in the transaction and triggers
them to execute COMMIT or ROLLBACK. IMO this change can be applied
without 2PC feature. Thought?

Even if we commit this patch and add new interface for FDW, we would
need to keep the old interface, for the FDW providing only old interface.


#2
Originally when there was the FDW access in the transaction,
PREPARE TRANSACTION on that transaction failed with an error. The patch
allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
even when FDW access occurs in the transaction. IMO this change can be
applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
inside "top" COMMIT command). Thought?

I'm not sure yet whether automatic resolution of "unresolved" prepared
transactions by the resolver process is necessary for this change or not.
If it's not necessary, it's better to exclude the resolver process from this
change, at this stage, to make the patch simpler.


#3
Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
based on the #1 and #2 patches.


What's your opinion about this?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

07 September 2020, 14:38:07


On 2020/09/07 17:59, Fujii Masao wrote:
> 
> 
> On 2020/08/21 15:25, Masahiko Sawada wrote:
>> On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>
>>>
>>>
>>> On 2020/07/27 15:59, Masahiko Sawada wrote:
>>>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>>>>>
>>>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
>>>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>>>>>>>>>>> I've attached the latest version patches. I've incorporated the review
>>>>>>>>>>> comments I got so far and improved locking strategy.
>>>>>>>>>>
>>>>>>>>>> Thanks for updating the patch!
>>>>>>>>>
>>>>>>>>> +1
>>>>>>>>> I'm interested in these patches and now studying them. While checking
>>>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
>>>>>>>>
>>>>>>>> Thank you for testing this patch!
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
>>>>>>>>> But in the patched version, when I did that, I got the following error.
>>>>>>>>> Is this intentional?
>>>>>>>>>
>>>>>>>>> ERROR:  cannot assign TransactionIds during recovery
>>>>>>>>
>>>>>>>> No, it should be fixed. I'm going to fix this by not collecting
>>>>>>>> participants for atomic commit during recovery.
>>>>>>>
>>>>>>> Thanks for trying to fix the issues!
>>>>>>>
>>>>>>> I'd like to report one more issue. When I started new transaction
>>>>>>> in the local server, executed INSERT in the remote server via
>>>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
>>>>>>>
>>>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>>>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
>>>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
>>>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
>>>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
>>>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
>>>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
>>>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
>>>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
>>>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
>>>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
>>>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
>>>>>>> 11  postgres                            0x000000010d1597b9 main + 761
>>>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
>>>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
>>>>>>>
>>>>>>
>>>>>> Thank you for reporting the issue!
>>>>>>
>>>>>> I've attached the latest version patch that incorporated all comments
>>>>>> I got so far. I've removed the patch adding the 'prefer' mode of
>>>>>> foreign_twophase_commit to keep the patch set simple.
>>>>>
>>>>>
>>>>> I have started to review the patchset. Just a quick comment.
>>>>>
>>>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
>>>>> contains changes (adding fdwxact includes) for
>>>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
>>>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
>>>>> required with the latest version.
>>>>
>>>> Thanks for your comment.
>>>>
>>>> Right. I've removed these changes on the local branch.
>>>
>>> The latest patches failed to be applied to the master branch. Could you rebase the patches?
>>>
>>
>> Thank you for letting me know. I've attached the latest version patch set.
> 
> Thanks for updating the patch!
> 
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually. What about separating
> the feature into the following parts?
> 
> #1
> Originally the server just executed xact callback that each FDW registered
> when the transaction was committed. The patch changes this so that
> the server manages the participants of FDW in the transaction and triggers
> them to execute COMMIT or ROLLBACK. IMO this change can be applied
> without 2PC feature. Thought?
> 
> Even if we commit this patch and add new interface for FDW, we would
> need to keep the old interface, for the FDW providing only old interface.
> 
> 
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
> 
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.
> 
> 
> #3
> Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
> based on the #1 and #2 patches.
> 
> 
> What's your opinion about this?

Also I'd like to report some typos in the patch.

+#define ServerSupportTransactionCallack(fdw_part) \

"Callack" in this macro name should be "Callback"?

+#define SeverSupportTwophaseCommit(fdw_part) \

"Sever" in this macro name should be "Server"?

+  proname => 'pg_stop_foreing_xact_resolver', provolatile => 'v', prorettype => 'bool',

"foreing" should be "foreign"?

+ * FdwXact entry we call get_preparedid callback to get a transaction

"get_preparedid" should be "get_prepareid"?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Amit Kapila

Date:

08 September 2020, 01:34:00

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually.
>

Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.
1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.
2. In this thread, we have discussed to try integrating this patch
with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
APIs we are exposing are general enough that other FDWs can use them
to implement 2PC. I could see some speculations about the same but no
concrete work on the same has been done.
3. In another thread [1], we have seen that the patch being discussed
in this thread might need to re-designed if we have to use some other
design for global-visibility than what is proposed in that thread. I
think it is quite likely that can happen considering no one is able to
come up with the solution to major design problems spotted in that
patch yet.

It appears to me that even though these points were raised before in
some form we are just trying to bypass them to commit whatever we have
in the current patch which I find quite surprising.

[1] - https://www.postgresql.org/message-id/07b2c899-4ed0-4c87-1327-23c750311248%40postgrespro.ru

-- 
With Regards,
Amit Kapila.

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

08 September 2020, 02:35:22


On 2020/09/08 10:34, Amit Kapila wrote:
> On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> IMO it's not easy to commit this 2PC patch at once because it's still large
>> and complicated. So I'm thinking it's better to separate the feature into
>> several parts and commit them gradually.
>>
> 
> Hmm, I don't see that we have a consensus on the design and or
> interfaces of this patch and without that proceeding for commit
> doesn't seem advisable. Here are a few points which I remember offhand
> that require more work.

Thanks!

> 1. There is a competing design proposed and being discussed in another
> thread [1] for this purpose. I think both the approaches have pros and
> cons but there doesn't seem to be any conclusion yet on which one is
> better.

I was thinking that [1] was discussing global snapshot feature for
"atomic visibility" rather than the solution like 2PC for "atomic commit".
But if another approach for "atomic commit" was also proposed at [1],
that's good. I will check that.

> 2. In this thread, we have discussed to try integrating this patch
> with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
> APIs we are exposing are general enough that other FDWs can use them
> to implement 2PC. I could see some speculations about the same but no
> concrete work on the same has been done.

Yes, you're right.

> 3. In another thread [1], we have seen that the patch being discussed
> in this thread might need to re-designed if we have to use some other
> design for global-visibility than what is proposed in that thread. I
> think it is quite likely that can happen considering no one is able to
> come up with the solution to major design problems spotted in that
> patch yet.

You imply that global-visibility patch should be come first before "2PC" patch?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Amit Kapila

Date:

08 September 2020, 03:03:50

On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2020/09/08 10:34, Amit Kapila wrote:
> > On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >> IMO it's not easy to commit this 2PC patch at once because it's still large
> >> and complicated. So I'm thinking it's better to separate the feature into
> >> several parts and commit them gradually.
> >>
> >
> > Hmm, I don't see that we have a consensus on the design and or
> > interfaces of this patch and without that proceeding for commit
> > doesn't seem advisable. Here are a few points which I remember offhand
> > that require more work.
>
> Thanks!
>
> > 1. There is a competing design proposed and being discussed in another
> > thread [1] for this purpose. I think both the approaches have pros and
> > cons but there doesn't seem to be any conclusion yet on which one is
> > better.
>
> I was thinking that [1] was discussing global snapshot feature for
> "atomic visibility" rather than the solution like 2PC for "atomic commit".
> But if another approach for "atomic commit" was also proposed at [1],
> that's good. I will check that.
>

Okay, that makes sense.

> > 2. In this thread, we have discussed to try integrating this patch
> > with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
> > APIs we are exposing are general enough that other FDWs can use them
> > to implement 2PC. I could see some speculations about the same but no
> > concrete work on the same has been done.
>
> Yes, you're right.
>
> > 3. In another thread [1], we have seen that the patch being discussed
> > in this thread might need to re-designed if we have to use some other
> > design for global-visibility than what is proposed in that thread. I
> > think it is quite likely that can happen considering no one is able to
> > come up with the solution to major design problems spotted in that
> > patch yet.
>
> You imply that global-visibility patch should be come first before "2PC" patch?
>

I intend to say that the global-visibility work can impact this in a
major way and we have analyzed that to some extent during a discussion
on the other thread. So, I think without having a complete
design/solution that addresses both the 2PC and global-visibility, it
is not apparent what is the right way to proceed. It seems to me that
rather than working on individual (or smaller) parts one needs to come
up with a bigger picture (or overall design) and then once we have
figured that out correctly, it would be easier to decide which parts
can go first.

-- 
With Regards,
Amit Kapila.

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

08 September 2020, 04:00:45

From: Amit Kapila <amit.kapila16@gmail.com>
> I intend to say that the global-visibility work can impact this in a
> major way and we have analyzed that to some extent during a discussion
> on the other thread. So, I think without having a complete
> design/solution that addresses both the 2PC and global-visibility, it
> is not apparent what is the right way to proceed. It seems to me that
> rather than working on individual (or smaller) parts one needs to come
> up with a bigger picture (or overall design) and then once we have
> figured that out correctly, it would be easier to decide which parts
> can go first.

I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days.
 

But to proceed with the development, let me comment on the atomic commit and global visibility.

* We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if we
canavoid it.
 

* I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes.
 

* Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to
betransferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this?
 

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues to
beaddressed:
 

1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification.
 

2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC.
 

3. postgres_fdw cannot detect remote updates when the UDF executed on a remote node updates data.


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

08 September 2020, 05:16:17

On Mon, 7 Sep 2020 at 17:59, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/08/21 15:25, Masahiko Sawada wrote:
> > On Fri, 21 Aug 2020 at 00:36, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>
> >>
> >> On 2020/07/27 15:59, Masahiko Sawada wrote:
> >>> On Thu, 23 Jul 2020 at 22:51, Muhammad Usama <m.usama@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
> >>>>>
> >>>>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 2020/07/16 14:47, Masahiko Sawada wrote:
> >>>>>>> On Tue, 14 Jul 2020 at 11:19, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2020/07/14 9:08, Masahiro Ikeda wrote:
> >>>>>>>>>> I've attached the latest version patches. I've incorporated the review
> >>>>>>>>>> comments I got so far and improved locking strategy.
> >>>>>>>>>
> >>>>>>>>> Thanks for updating the patch!
> >>>>>>>>
> >>>>>>>> +1
> >>>>>>>> I'm interested in these patches and now studying them. While checking
> >>>>>>>> the behaviors of the patched PostgreSQL, I got three comments.
> >>>>>>>
> >>>>>>> Thank you for testing this patch!
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 1. We can access to the foreign table even during recovery in the HEAD.
> >>>>>>>> But in the patched version, when I did that, I got the following error.
> >>>>>>>> Is this intentional?
> >>>>>>>>
> >>>>>>>> ERROR:  cannot assign TransactionIds during recovery
> >>>>>>>
> >>>>>>> No, it should be fixed. I'm going to fix this by not collecting
> >>>>>>> participants for atomic commit during recovery.
> >>>>>>
> >>>>>> Thanks for trying to fix the issues!
> >>>>>>
> >>>>>> I'd like to report one more issue. When I started new transaction
> >>>>>> in the local server, executed INSERT in the remote server via
> >>>>>> postgres_fdw and then quit psql, I got the following assertion failure.
> >>>>>>
> >>>>>> TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
> >>>>>> 0   postgres                            0x000000010d52f3c0 ExceptionalCondition + 160
> >>>>>> 1   postgres                            0x000000010cefbc49 ForgetAllFdwXactParticipants + 313
> >>>>>> 2   postgres                            0x000000010cefff14 AtProcExit_FdwXact + 20
> >>>>>> 3   postgres                            0x000000010d313fe3 shmem_exit + 179
> >>>>>> 4   postgres                            0x000000010d313e7a proc_exit_prepare + 122
> >>>>>> 5   postgres                            0x000000010d313da3 proc_exit + 19
> >>>>>> 6   postgres                            0x000000010d35112f PostgresMain + 3711
> >>>>>> 7   postgres                            0x000000010d27bb3a BackendRun + 570
> >>>>>> 8   postgres                            0x000000010d27af6b BackendStartup + 475
> >>>>>> 9   postgres                            0x000000010d279ed1 ServerLoop + 593
> >>>>>> 10  postgres                            0x000000010d277940 PostmasterMain + 6016
> >>>>>> 11  postgres                            0x000000010d1597b9 main + 761
> >>>>>> 12  libdyld.dylib                       0x00007fff7161e3d5 start + 1
> >>>>>> 13  ???                                 0x0000000000000003 0x0 + 3
> >>>>>>
> >>>>>
> >>>>> Thank you for reporting the issue!
> >>>>>
> >>>>> I've attached the latest version patch that incorporated all comments
> >>>>> I got so far. I've removed the patch adding the 'prefer' mode of
> >>>>> foreign_twophase_commit to keep the patch set simple.
> >>>>
> >>>>
> >>>> I have started to review the patchset. Just a quick comment.
> >>>>
> >>>> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> >>>> contains changes (adding fdwxact includes) for
> >>>> src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
> >>>> and  src/backend/executor/execPartition.c files that doesn't seem to be
> >>>> required with the latest version.
> >>>
> >>> Thanks for your comment.
> >>>
> >>> Right. I've removed these changes on the local branch.
> >>
> >> The latest patches failed to be applied to the master branch. Could you rebase the patches?
> >>
> >
> > Thank you for letting me know. I've attached the latest version patch set.
>
> Thanks for updating the patch!
>
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually. What about separating
> the feature into the following parts?
>
> #1
> Originally the server just executed xact callback that each FDW registered
> when the transaction was committed. The patch changes this so that
> the server manages the participants of FDW in the transaction and triggers
> them to execute COMMIT or ROLLBACK. IMO this change can be applied
> without 2PC feature. Thought?
>
> Even if we commit this patch and add new interface for FDW, we would
> need to keep the old interface, for the FDW providing only old interface.
>
>
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
>
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.
>
>
> #3
> Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
> based on the #1 and #2 patches.
>
>
> What's your opinion about this?

Regardless of which approaches of 2PC implementation being selected
splitting the patch into logical small patches is a good idea and the
above suggestion makes sense to me.

Regarding #2, I guess that we would need resolver and launcher
processes even if we would support only manual PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED commands:

On COMMIT PREPARED command, I think we should commit the local
prepared transaction first then commit foreign prepared transactions.
Otherwise, it violates atomic commit principles when the local node
failed to commit a foreign prepared transaction and the user changed
to ROLLBACK PREPARED. OTOH once we committed locally, we cannot change
to rollback. And attempting to commit foreign prepared transactions
could lead an error due to connection error, OOM caused by palloc etc.
Therefore we discussed using background processes, resolver and
launcher, to take in charge of committing foreign prepared
transactions so that the process who executed COMMIT PREPARED will
never  error out after local commit. So I think the patch #2 will have
the patch also adding resolver and launcher processes. And in the
patch #3 we will change the code to support automatic 2PC as you
suggested.

In addition, the part of the automatic resolution of in-doubt
transactions can also be a separate patch, which will be the #4 patch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ashutosh Bapat

Date:

08 September 2020, 13:20:08

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> #2
> Originally when there was the FDW access in the transaction,
> PREPARE TRANSACTION on that transaction failed with an error. The patch
> allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
> even when FDW access occurs in the transaction. IMO this change can be
> applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
> inside "top" COMMIT command). Thought?
>
> I'm not sure yet whether automatic resolution of "unresolved" prepared
> transactions by the resolver process is necessary for this change or not.
> If it's not necessary, it's better to exclude the resolver process from this
> change, at this stage, to make the patch simpler.

I agree with this. However, in case of explicit prepare, if we are not
going to try automatic resolution, it might be better to provide a way
to pass the information about transactions prepared on the foreign
servers if they can not be resolved at the time of commit so that the
user can take it up to resolve those him/herself. This was an idea
that Tom had suggested at the very beginning of the first take.

--
Best Wishes,
Ashutosh Bapat

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

09 September 2020, 18:12:56

On 2020/09/08 12:03, Amit Kapila wrote:
> On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>> On 2020/09/08 10:34, Amit Kapila wrote:
>>> On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>
>>>> IMO it's not easy to commit this 2PC patch at once because it's still large
>>>> and complicated. So I'm thinking it's better to separate the feature into
>>>> several parts and commit them gradually.
>>>>
>>>
>>> Hmm, I don't see that we have a consensus on the design and or
>>> interfaces of this patch and without that proceeding for commit
>>> doesn't seem advisable. Here are a few points which I remember offhand
>>> that require more work.
>>
>> Thanks!
>>
>>> 1. There is a competing design proposed and being discussed in another
>>> thread [1] for this purpose. I think both the approaches have pros and
>>> cons but there doesn't seem to be any conclusion yet on which one is
>>> better.
>>
>> I was thinking that [1] was discussing global snapshot feature for
>> "atomic visibility" rather than the solution like 2PC for "atomic commit".
>> But if another approach for "atomic commit" was also proposed at [1],
>> that's good. I will check that.
>>
> 
> Okay, that makes sense.

I read Alexey's 2PC patch (0001-Add-postgres_fdw.use_twophase-GUC-to-use-2PC.patch)
proposed at [1]. As Alexey told at that thread, there are two big differences
between his patch and Sawada-san's; 1) whether there is the resolver process
for foreign transactions, 2) 2PC logic is implemented only inside postgres_fdw
or both FDW and PostgreSQL core.

I think that 2) is the first decision point. Alexey's 2PC patch is very simple
and all the 2PC logic is implemented only inside postgres_fdw. But this
means that 2PC is not usable if multiple types of FDW (e.g., postgres_fdw
and mysql_fdw) participate at the transaction. This may be ok if we implement
2PC feature only for PostgreSQL sharding using postgres_fdw. But if we
implement 2PC as the improvement on FDW independently from PostgreSQL
sharding, I think that it's necessary to support other FDW. And this is our
direction, isn't it?

Sawada-san's patch supports that case by implememnting some conponents
for that also in PostgreSQL core. For example, with the patch, all the remote
transactions that participate at the transaction are managed by PostgreSQL
core instead of postgres_fdw layer.

Therefore, at least regarding the difference 2), I think that Sawada-san's
approach is better. Thought?

[1]
https://postgr.es/m/3ef7877bfed0582019eab3d462a43275@postgrespro.ru

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

10 September 2020, 01:13:08

Alexey-san, Sawada-san,
cc: Fujii-san,


From: Fujii Masao <masao.fujii@oss.nttdata.com>
> But if we
> implement 2PC as the improvement on FDW independently from PostgreSQL
> sharding, I think that it's necessary to support other FDW. And this is our
> direction, isn't it?

I understand the same way as Fujii san.  2PC FDW is itself useful, so I think we should pursue the tidy FDW interface
andgood performance withinn the FDW framework.  "tidy" means that many other FDWs should be able to implement it.  I
guessXA/JTA is the only material we can use to consider whether the FDW interface is good.
 


> Sawada-san's patch supports that case by implememnting some conponents
> for that also in PostgreSQL core. For example, with the patch, all the remote
> transactions that participate at the transaction are managed by PostgreSQL
> core instead of postgres_fdw layer.
> 
> Therefore, at least regarding the difference 2), I think that Sawada-san's
> approach is better. Thought?

I think so.  Sawada-san's patch needs to address the design issues I posed before digging into the code for thorough
review,though.
 

BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm concerned about the performance for practical
use. Do you two have differences in these points, for instance?  The first two items are often cited to evaluate the
algorithm'sperformance, as you know.
 

* The number of round trips to remote nodes.
* The number of disk I/Os on each node and all nodes in total (WAL, two-phase file, pg_subtrans file, CLOG?).
* Are prepare and commit executed in parallel on remote nodes? (serious DBMSs do so)
* Is there any serialization point in the processing? (Sawada-san's has one)

I'm sorry to repeat myself, but I don't think we can compromise the 2PC performance.  Of course, we recommend users to
designa schema that co-locates data that each transaction accesses to avoid 2PC, but it's not always possible (e.g.,
whensecondary indexes are used.)
 

Plus, as the following quote from TPC-C specification shows, TPC-C requires 15% of (Payment?) transactions to do 2PC.
(Iknew this on Microsoft, CockroachDB, or Citus Data's site.)
 


--------------------------------------------------
Independent of the mode of selection, the customer resident 
warehouse is the home warehouse 85% of the time and is a randomly selected remote warehouse 15% of the time. 
This can be implemented by generating two random numbers x and y within [1 .. 100]; 

. If x <= 85 a customer is selected from the selected district number (C_D_ID = D_ID) and the home warehouse 
number (C_W_ID = W_ID). The customer is paying through his/her own warehouse. 

. If x > 85 a customer is selected from a random district number (C_D_ID is randomly selected within [1 .. 10]), 
and a random remote warehouse number (C_W_ID is randomly selected within the range of active 
warehouses (see Clause 4.2.2), and C_W_ID ≠ W_ID). The customer is paying through a warehouse and a 
district other than his/her own. 
--------------------------------------------------


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

10 September 2020, 11:16:18

On 2020/09/10 10:13, tsunakawa.takay@fujitsu.com wrote:
> Alexey-san, Sawada-san,
> cc: Fujii-san,
> 
> 
> From: Fujii Masao <masao.fujii@oss.nttdata.com>
>> But if we
>> implement 2PC as the improvement on FDW independently from PostgreSQL
>> sharding, I think that it's necessary to support other FDW. And this is our
>> direction, isn't it?
> 
> I understand the same way as Fujii san.  2PC FDW is itself useful, so I think we should pursue the tidy FDW interface
andgood performance withinn the FDW framework.  "tidy" means that many other FDWs should be able to implement it.  I
guessXA/JTA is the only material we can use to consider whether the FDW interface is good.

Originally start(), commit() and rollback() are supported as FDW interfaces. With his patch, prepare() is supported.
Whatother interfaces need to be supported per XA/JTA?

As far as I and Sawada-san discussed this upthread, to support MySQL, another type of start() would be necessary to
issue"XA START id" command. end() might be also necessary to issue "XA END id", but that command can be issued via
prepare()together with "XA PREPARE id".

I'm not familiar with XA/JTA and XA transaction interfaces on other major DBMS. So I'd like to know what other
interfacesare necessary additionally?

> 
> 
>> Sawada-san's patch supports that case by implememnting some conponents
>> for that also in PostgreSQL core. For example, with the patch, all the remote
>> transactions that participate at the transaction are managed by PostgreSQL
>> core instead of postgres_fdw layer.
>>
>> Therefore, at least regarding the difference 2), I think that Sawada-san's
>> approach is better. Thought?
> 
> I think so.  Sawada-san's patch needs to address the design issues I posed before digging into the code for thorough
review,though.

> 
> BTW, is there something Sawada-san can take from Alexey-san's patch?  I'm concerned about the performance for
practicaluse.  Do you two have differences in these points, for instance?

IMO Sawada-san's version of 2PC is less performant, but it's because
his patch provides more functionality. For example, with his patch,
WAL is written to automatically complete the unresolve foreign transactions
in the case of failure. OTOH, Alexey patch introduces no new WAL for 2PC.
Of course, generating more WAL would cause more overhead.
But if we need automatic resolution feature, it's inevitable to introduce
new WAL whichever the patch we choose.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

10 September 2020, 15:37:03

On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > I intend to say that the global-visibility work can impact this in a
> > major way and we have analyzed that to some extent during a discussion
> > on the other thread. So, I think without having a complete
> > design/solution that addresses both the 2PC and global-visibility, it
> > is not apparent what is the right way to proceed. It seems to me that
> > rather than working on individual (or smaller) parts one needs to come
> > up with a bigger picture (or overall design) and then once we have
> > figured that out correctly, it would be easier to decide which parts
> > can go first.
>
> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days. 
>
> But to proceed with the development, let me comment on the atomic commit and global visibility.
>
> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if
wecan avoid it. 
>
> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes. 
>
> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information
tobe transferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this? 
>
> Commitment ordering (CO) - yoavraz2
> https://sites.google.com/site/yoavraz2/the_principle_of_co
>
>
> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues
tobe addressed: 
>
> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification. 

I agree that we need to verify new FDW APIs will be suitable for other
FDWs than postgres_fdw as well.

>
> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC. 

Not sure it's safe that each backend perform PREPARE and COMMIT
PREPARED since the current design is for not leading an inconsistency
between the actual transaction result and the result the user sees.
But in the future, I think we can have multiple background workers per
database for better performance.

>
> 3. postgres_fdw cannot detect remote updates when the UDF executed on a remote node updates data.

I assume that you mean the pushing the UDF down to a foreign server.
If so, I think we can do this by improving postgres_fdw. In the
current patch, registering and unregistering a foreign server to a
group of 2PC and marking a foreign server as updated is FDW
responsible. So perhaps if we had a way to tell postgres_fdw that the
UDF might update the data on the foreign server, postgres_fdw could
mark the foreign server as updated if the UDF is shippable.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

11 September 2020, 02:58:24


On 2020/09/11 0:37, Masahiko Sawada wrote:
> On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
>>
>> From: Amit Kapila <amit.kapila16@gmail.com>
>>> I intend to say that the global-visibility work can impact this in a
>>> major way and we have analyzed that to some extent during a discussion
>>> on the other thread. So, I think without having a complete
>>> design/solution that addresses both the 2PC and global-visibility, it
>>> is not apparent what is the right way to proceed. It seems to me that
>>> rather than working on individual (or smaller) parts one needs to come
>>> up with a bigger picture (or overall design) and then once we have
>>> figured that out correctly, it would be easier to decide which parts
>>> can go first.
>>
>> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to discuss
thebig picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes, nobody is
holdingme captive; I'm just late.)  Please wait a few days.
 
>>
>> But to proceed with the development, let me comment on the atomic commit and global visibility.
>>
>> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and if
wecan avoid it.
 
>>
>> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes.
 
>>
>> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information
tobe transferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this?
 
>>
>> Commitment ordering (CO) - yoavraz2
>> https://sites.google.com/site/yoavraz2/the_principle_of_co
>>
>>
>> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following issues
tobe addressed:
 
>>
>> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification.
 
> 
> I agree that we need to verify new FDW APIs will be suitable for other
> FDWs than postgres_fdw as well.
> 
>>
>> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction throughput.
Eachbackend should perform 2PC.
 
> 
> Not sure it's safe that each backend perform PREPARE and COMMIT
> PREPARED since the current design is for not leading an inconsistency
> between the actual transaction result and the result the user sees.

Can I check my understanding about why the resolver process is necessary?

Firstly, you think that issuing COMMIT PREPARED command to the foreign server can cause an error, for example, because
ofconnection error, OOM, etc. On the other hand, only waiting for other process to issue the command is less likely to
causean error. Right?
 

If an error occurs in backend process after commit record is WAL-logged, the error would be reported to the client and
itmay misunderstand that the transaction failed even though commit record was already flushed. So you think that each
backendshould not issue COMMIT PREPARED command to avoid that inconsistency. To avoid that, it's better to make other
process,the resolver, issue the command and just make each backend wait for that to completed. Right?
 

Also using the resolver process has another merit; when there are unresolved foreign transactions but the corresponding
backendexits, the resolver can try to resolve them. If something like this automatic resolution is necessary, the
processlike the resolver would be necessary. Right?
 

To the contrary, if we don't need such automatic resolution (i.e., unresolved foreign transactions always need to be
resolvedmanually) and we can prevent the code to issue COMMIT PREPARED command from causing an error (not sure if
that'spossible, though...), probably we don't need the resolver process. Right?
 


> But in the future, I think we can have multiple background workers per
> database for better performance.

Yes, that's an idea.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

11 September 2020, 08:15:08

From: Fujii Masao <masao.fujii@oss.nttdata.com>
> Originally start(), commit() and rollback() are supported as FDW interfaces.
> As far as I and Sawada-san discussed this upthread, to support MySQL,
> another type of start() would be necessary to issue "XA START id" command.
> end() might be also necessary to issue "XA END id", but that command can be
> issued via prepare() together with "XA PREPARE id".

Yeah, I think we can call xa_end and xa_prepare in the FDW's prepare function.

The issue is when to call xa_start, which requires XID as an argument.  We don't want to call it in transactions that
accessonly one node...?
 


> With his patch, prepare() is supported. What other interfaces need to be
> supported per XA/JTA?
> 
> I'm not familiar with XA/JTA and XA transaction interfaces on other major
> DBMS. So I'd like to know what other interfaces are necessary additionally?

I think xa_start, xa_end, xa_prepare, xa_commit, xa_rollback, and xa_recover are sufficient.  The XA specification is
here:

https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf

You can see the function reference in Chapter 5, and the concept in Chapter 3.  Chapter 6 was probably showing the
statetransition (function call sequence.)
 


> IMO Sawada-san's version of 2PC is less performant, but it's because his
> patch provides more functionality. For example, with his patch, WAL is written
> to automatically complete the unresolve foreign transactions in the case of
> failure. OTOH, Alexey patch introduces no new WAL for 2PC.
> Of course, generating more WAL would cause more overhead.
> But if we need automatic resolution feature, it's inevitable to introduce new
> WAL whichever the patch we choose.

Please do not get me wrong.  I know Sawada-san is trying to ensure durability.  I just wanted to know what each patch
doesin how much cost in terms of disk and network I/Os, and if one patch can take something from another for less cost.
I'm simply guessing (without having read the code yet) that each transaction basically does:
 

- two round trips (prepare, commit) to each remote node
- two WAL writes (prepare, commit) on the local node and each remote node
- one write for two-phase state file on each remote node
- one write to record participants on the local node

It felt hard to think about the algorithm efficiency from the source code.  As you may have seen, the DBMS textbook
and/orpapers describe disk and network I/Os to evaluate algorithms.  I thought such information would be useful before
goingdeeper into the source code.  Maybe such things can be written in the following Sawada-san's wiki or README in the
end.

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

11 September 2020, 09:24:00

From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > 2. 2PC processing is queued and serialized in one background worker.  That
> severely subdues transaction throughput.  Each backend should perform
> 2PC.
> 
> Not sure it's safe that each backend perform PREPARE and COMMIT
> PREPARED since the current design is for not leading an inconsistency
> between the actual transaction result and the result the user sees.

As Fujii-san is asking, I also would like to know what situation you think is not safe.  Are you worried that the FDW's
commitfunction might call ereport(ERROR | FATAL | PANIC)?  If so, can't we stipulate that the FDW implementor should
ensurethat the commit function always returns control to the caller?

> But in the future, I think we can have multiple background workers per
> database for better performance.

Does the database in "per database" mean the local database (that applications connect to), or the remote database
accessedvia FDW?

I'm wondering how the FDW and background worker(s) can realize parallel prepare and parallel commit.  That is, the
coordinatortransaction performs:

1. Issue prepare to all participant nodes, but doesn't wait for the reply for each issue.
2. Waits for replies from all participants.
3. Issue commit to all participant nodes, but doesn't wait for the reply for each issue.
4. Waits for replies from all participants.

If we just consider PostgreSQL and don't think about FDW, we can use libpq async functions -- PQsendQuery,
PQconsumeInput,and PQgetResult.  pgbench uses them so that one thread can issue SQL statements on multiple connections
inparallel.

But when we consider the FDW interface, plus other DBMSs, how can we achieve the parallelism?

> > 3. postgres_fdw cannot detect remote updates when the UDF executed on a
> remote node updates data.
> 
> I assume that you mean the pushing the UDF down to a foreign server.
> If so, I think we can do this by improving postgres_fdw. In the current patch,
> registering and unregistering a foreign server to a group of 2PC and marking a
> foreign server as updated is FDW responsible. So perhaps if we had a way to
> tell postgres_fdw that the UDF might update the data on the foreign server,
> postgres_fdw could mark the foreign server as updated if the UDF is shippable.

Maybe we can consider VOLATILE functions update data.  That may be overreaction, though.

Another idea is to add a new value to the ReadyForQuery message in the FE/BE protocol.  Say, 'U' if in a transaction
blockthat updated data.  Here we consider "updated" as having allocated an XID.

52.7. Message Formats
https://www.postgresql.org/docs/devel/protocol-message-formats.html
--------------------------------------------------
ReadyForQuery (B)

Byte1
Current backend transaction status indicator. Possible values are 'I' if idle (not in a transaction block); 'T' if in a
transactionblock; or 'E' if in a failed transaction block (queries will be rejected until block is ended).

--------------------------------------------------

Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

11 September 2020, 11:06:52

On Fri, 11 Sep 2020 at 11:58, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/09/11 0:37, Masahiko Sawada wrote:
> > On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> >>
> >> From: Amit Kapila <amit.kapila16@gmail.com>
> >>> I intend to say that the global-visibility work can impact this in a
> >>> major way and we have analyzed that to some extent during a discussion
> >>> on the other thread. So, I think without having a complete
> >>> design/solution that addresses both the 2PC and global-visibility, it
> >>> is not apparent what is the right way to proceed. It seems to me that
> >>> rather than working on individual (or smaller) parts one needs to come
> >>> up with a bigger picture (or overall design) and then once we have
> >>> figured that out correctly, it would be easier to decide which parts
> >>> can go first.
> >>
> >> I'm really sorry I've been getting late and late and latex10 to publish the revised scale-out design wiki to
discussthe big picture!  I don't know why I'm taking this long time; I feel I were captive in a time prison (yes,
nobodyis holding me captive; I'm just late.)  Please wait a few days. 
> >>
> >> But to proceed with the development, let me comment on the atomic commit and global visibility.
> >>
> >> * We have to hear from Andrey about their check on the possibility that Clock-SI could be Microsoft's patent and
ifwe can avoid it. 
> >>
> >> * I have a feeling that we can adopt the algorithm used by Spanner, CockroachDB, and YugabyteDB.  That is, 2PC for
multi-nodeatomic commit, Paxos or Raft for replica synchronization (in the process of commit) to make 2PC more highly
available,and the timestamp-based global visibility.  However, the timestamp-based approach makes the database instance
shutdown when the node's clock is distant from the other nodes. 
> >>
> >> * Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other
informationto be transferred among the cluster nodes.  However, this seems to have to track the order of read and write
operationsamong concurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The
MVCOpaper seems to present the information we need, but I haven't understood it well yet (it's difficult.)  Could you
anybodykindly interpret this? 
> >>
> >> Commitment ordering (CO) - yoavraz2
> >> https://sites.google.com/site/yoavraz2/the_principle_of_co
> >>
> >>
> >> As for the Sawada-san's 2PC patch, which I find interesting purely as FDW enhancement, I raised the following
issuesto be addressed: 
> >>
> >> 1. Make FDW API implementable by other FDWs than postgres_fdw (this is what Amit-san kindly pointed out.)  I think
oracle_fdwand jdbc_fdw would be good examples to consider, while MySQL may not be good because it exposes the XA
featureas SQL statements, not C functions as defined in the XA specification. 
> >
> > I agree that we need to verify new FDW APIs will be suitable for other
> > FDWs than postgres_fdw as well.
> >
> >>
> >> 2. 2PC processing is queued and serialized in one background worker.  That severely subdues transaction
throughput. Each backend should perform 2PC. 
> >
> > Not sure it's safe that each backend perform PREPARE and COMMIT
> > PREPARED since the current design is for not leading an inconsistency
> > between the actual transaction result and the result the user sees.
>
> Can I check my understanding about why the resolver process is necessary?
>
> Firstly, you think that issuing COMMIT PREPARED command to the foreign server can cause an error, for example,
becauseof connection error, OOM, etc. On the other hand, only waiting for other process to issue the command is less
likelyto cause an error. Right? 
>
> If an error occurs in backend process after commit record is WAL-logged, the error would be reported to the client
andit may misunderstand that the transaction failed even though commit record was already flushed. So you think that
eachbackend should not issue COMMIT PREPARED command to avoid that inconsistency. To avoid that, it's better to make
otherprocess, the resolver, issue the command and just make each backend wait for that to completed. Right? 
>
> Also using the resolver process has another merit; when there are unresolved foreign transactions but the
correspondingbackend exits, the resolver can try to resolve them. If something like this automatic resolution is
necessary,the process like the resolver would be necessary. Right? 
>
> To the contrary, if we don't need such automatic resolution (i.e., unresolved foreign transactions always need to be
resolvedmanually) and we can prevent the code to issue COMMIT PREPARED command from causing an error (not sure if
that'spossible, though...), probably we don't need the resolver process. Right? 

Yes, I'm on the same page about all the above explanations.

The resolver process has two functionalities: resolving foreign
transactions automatically when the user issues COMMIT (the case you
described in the second paragraph), and resolving foreign transaction
when the corresponding backend no longer exist or when the server
crashes during in the middle of 2PC (described in the third
paragraph).

Considering the design without the resolver process, I think we can
easily replace the latter with the manual resolution. OTOH, it's not
easy for the former. I have no idea about better design for now,
although, as you described, if we could ensure that the process
doesn't raise an error during resolving foreign transactions after
committing the local transaction we would not need the resolver
process.

Or the second idea would be that the backend commits only the local
transaction then returns the acknowledgment of COMMIT to the user
without resolving foreign transactions. Then the user manually
resolves the foreign transactions by, for example, using the SQL
function pg_resolve_foreign_xact() within a separate transaction. That
way, even if an error occurred during resolving foreign transactions
(i.g., executing COMMIT PREPARED), it’s okay as the user is already
aware of the local transaction having been committed and can retry to
resolve the unresolved foreign transaction. So we won't need the
resolver process while avoiding such inconsistency.

But a drawback would be that the transaction commit doesn't ensure
that all foreign transactions are completed. The subsequent
transactions would need to check if the previous distributed
transaction is completed to see its results. I’m not sure it’s a good
design in terms of usability.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

13 September 2020, 08:36:09

On Fri, 11 Sep 2020 at 18:24, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > On Tue, 8 Sep 2020 at 13:00, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > 2. 2PC processing is queued and serialized in one background worker.  That
> > severely subdues transaction throughput.  Each backend should perform
> > 2PC.
> >
> > Not sure it's safe that each backend perform PREPARE and COMMIT
> > PREPARED since the current design is for not leading an inconsistency
> > between the actual transaction result and the result the user sees.
>
> As Fujii-san is asking, I also would like to know what situation you think is not safe.  Are you worried that the
FDW'scommit function might call ereport(ERROR | FATAL | PANIC)?
 

Yes.

> If so, can't we stipulate that the FDW implementor should ensure that the commit function always returns control to
thecaller?
 

How can the FDW implementor ensure that? Since even palloc could call
ereport(ERROR) I guess it's hard to require that to all FDW
implementors.

>
>
> > But in the future, I think we can have multiple background workers per
> > database for better performance.
>
> Does the database in "per database" mean the local database (that applications connect to), or the remote database
accessedvia FDW?
 

I meant the local database. In the current patch, we launch the
resolver process per local database. My idea is to allow launching
multiple resolver processes for one local database as long as the
number of workers doesn't exceed the limit.

>
> I'm wondering how the FDW and background worker(s) can realize parallel prepare and parallel commit.  That is, the
coordinatortransaction performs:
 
>
> 1. Issue prepare to all participant nodes, but doesn't wait for the reply for each issue.
> 2. Waits for replies from all participants.
> 3. Issue commit to all participant nodes, but doesn't wait for the reply for each issue.
> 4. Waits for replies from all participants.
>
> If we just consider PostgreSQL and don't think about FDW, we can use libpq async functions -- PQsendQuery,
PQconsumeInput,and PQgetResult.  pgbench uses them so that one thread can issue SQL statements on multiple connections
inparallel.
 
>
> But when we consider the FDW interface, plus other DBMSs, how can we achieve the parallelism?

It's still a rough idea but I think we can use TMASYNC flag and
xa_complete explained in the XA specification. The core transaction
manager call prepare, commit, rollback APIs with the flag, requiring
to execute the operation asynchronously and to return a handler (e.g.,
a socket taken by PQsocket in postgres_fdw case) to the transaction
manager. Then the transaction manager continues polling the handler
until it becomes readable and testing the completion using by
xa_complete() with no wait, until all foreign servers return OK on
xa_complete check.

>
>
> > > 3. postgres_fdw cannot detect remote updates when the UDF executed on a
> > remote node updates data.
> >
> > I assume that you mean the pushing the UDF down to a foreign server.
> > If so, I think we can do this by improving postgres_fdw. In the current patch,
> > registering and unregistering a foreign server to a group of 2PC and marking a
> > foreign server as updated is FDW responsible. So perhaps if we had a way to
> > tell postgres_fdw that the UDF might update the data on the foreign server,
> > postgres_fdw could mark the foreign server as updated if the UDF is shippable.
>
> Maybe we can consider VOLATILE functions update data.  That may be overreaction, though.

Sorry I don't understand that. The volatile functions are not pushed
down to the foreign servers in the first place, no?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ashutosh Bapat

Date:

15 September 2020, 13:24:44

On Fri, Sep 11, 2020 at 4:37 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

My initial patch used the same backend to resolve foreign
transactions. But in that case even though the user receives COMMIT
completed, the backend isn't accepting the next query till it is busy
resolving the foreign server. That might be a usability issue again if
attempting to resolve all foreign transactions takes noticeable time.
If we go this route, we should try to resolve as many foreign
transactions as possible ignoring any errors while doing so and
somehow let user know which transactions couldn't be resolved. User
can then take responsibility for resolving those.

>
> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
>
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I agree, this won't be acceptable.

In either case, I think a solution where the local server takes
responsibility to resolve foreign transactions will be better even in
the first cut.

--
Best Wishes,
Ashutosh Bapat

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

16 September 2020, 04:20:41

From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > If so, can't we stipulate that the FDW implementor should ensure that the
> commit function always returns control to the caller?
> 
> How can the FDW implementor ensure that? Since even palloc could call
> ereport(ERROR) I guess it's hard to require that to all FDW
> implementors.

I think the what FDW commit routine will do is to just call xa_commit(), or PQexec("COMMIT PREPARED") in postgres_fdw.


> It's still a rough idea but I think we can use TMASYNC flag and
> xa_complete explained in the XA specification. The core transaction
> manager call prepare, commit, rollback APIs with the flag, requiring
> to execute the operation asynchronously and to return a handler (e.g.,
> a socket taken by PQsocket in postgres_fdw case) to the transaction
> manager. Then the transaction manager continues polling the handler
> until it becomes readable and testing the completion using by
> xa_complete() with no wait, until all foreign servers return OK on
> xa_complete check.

Unfortunately, even Oracle and Db2 don't support XA asynchronous execution for years.  Our DBMS Symfoware doesn't,
either. I don't expect other DBMSs support it.
 

Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious scale-out DBMS.  If we define asynchronous
FDWroutines for 2PC, postgres_fdw would be able to implement them by using libpq asynchronous functions.  But other
DBMSscan't ...
 


> > Maybe we can consider VOLATILE functions update data.  That may be
> overreaction, though.
> 
> Sorry I don't understand that. The volatile functions are not pushed
> down to the foreign servers in the first place, no?

Ah, you're right.  Then, the choices are twofold: (1) trust users in that their functions don't update data or the
user'sclaim (specification) about it, and (2) get notification through FE/BE protocol that the remote transaction may
haveupdated data.
 


Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

16 September 2020, 05:52:49

From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> The resolver process has two functionalities: resolving foreign
> transactions automatically when the user issues COMMIT (the case you
> described in the second paragraph), and resolving foreign transaction
> when the corresponding backend no longer exist or when the server
> crashes during in the middle of 2PC (described in the third
> paragraph).
> 
> Considering the design without the resolver process, I think we can
> easily replace the latter with the manual resolution. OTOH, it's not
> easy for the former. I have no idea about better design for now,
> although, as you described, if we could ensure that the process
> doesn't raise an error during resolving foreign transactions after
> committing the local transaction we would not need the resolver
> process.

Yeah, the resolver background process -- someone independent of client sessions -- is necessary, because the client
sessiondisappears sometime.  When the server that hosts the 2PC coordinator crashes, there are no client sessions.  Our
DBMSSymfoware also runs background threads that take care of resolution of in-doubt transactions due to a server or
networkfailure.
 

Then, how does the resolver get involved in 2PC to enable parallel 2PC?  Two ideas quickly come to mind:

(1) Each client backend issues prepare and commit to multiple remote nodes asynchronously.
If the communication fails during commit, the client backend leaves the commit notification task to the resolver.
That is, the resolver lends a hand during failure recovery, and doesn't interfere with the transaction processing
duringnormal operation.
 

(2) The resolver takes some responsibility in 2PC processing during normal operation.
(send prepare and/or commit to remote nodes and get the results.)
To avoid serial execution per transaction, the resolver bundles multiple requests, send them in bulk, and wait for
multiplereplies at once.
 
This allows the coordinator to do its own prepare processing in parallel with those of participants.
However, in Postgres, this requires context switches between the client backend and the resolver.


Our Symfoware takes (2).  However, it doesn't suffer from the context switch, because the server is multi-threaded and
furtherimplements or uses more lightweight entities than the thread.
 


> Or the second idea would be that the backend commits only the local
> transaction then returns the acknowledgment of COMMIT to the user
> without resolving foreign transactions. Then the user manually
> resolves the foreign transactions by, for example, using the SQL
> function pg_resolve_foreign_xact() within a separate transaction. That
> way, even if an error occurred during resolving foreign transactions
> (i.g., executing COMMIT PREPARED), it’s okay as the user is already
> aware of the local transaction having been committed and can retry to
> resolve the unresolved foreign transaction. So we won't need the
> resolver process while avoiding such inconsistency.
> 
> But a drawback would be that the transaction commit doesn't ensure
> that all foreign transactions are completed. The subsequent
> transactions would need to check if the previous distributed
> transaction is completed to see its results. I’m not sure it’s a good
> design in terms of usability.

I don't think it's a good design as you are worried.  I guess that's why Postgres-XL had to create a tool called
pgxc_cleanand ask the user to resolve transactions with it.
 

pgxc_clean
https://www.postgres-xl.org/documentation/pgxcclean.html

"pgxc_clean is a Postgres-XL utility to maintain transaction status after a crash. When a Postgres-XL node crashes and
recoversor fails over, the commit status of the node may be inconsistent with other nodes. pgxc_clean checks
transactioncommit status and corrects them."
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Michael Paquier

Date:

17 September 2020, 05:25:19

On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> Thank you for letting me know. I've attached the latest version patch set.

A rebase is needed again as the CF bot is complaining.
--
Michael

Attachment

signature.asc

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

18 September 2020, 08:00:00

On Thu, 17 Sep 2020 at 14:25, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> > Thank you for letting me know. I've attached the latest version patch set.
>
> A rebase is needed again as the CF bot is complaining.

Thank you for letting me know. I'm updating the patch and splitting
into small pieces as Fujii-san suggested. I'll submit the latest patch
set early next week.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

18 September 2020, 13:56:12

On Wed, 16 Sep 2020 at 13:20, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > If so, can't we stipulate that the FDW implementor should ensure that the
> > commit function always returns control to the caller?
> >
> > How can the FDW implementor ensure that? Since even palloc could call
> > ereport(ERROR) I guess it's hard to require that to all FDW
> > implementors.
>
> I think the what FDW commit routine will do is to just call xa_commit(), or PQexec("COMMIT PREPARED") in
postgres_fdw.

Yes, but it still seems hard to me that we require for all FDW
implementations to commit/rollback prepared transactions without the
possibility of ERROR.

>
>
> > It's still a rough idea but I think we can use TMASYNC flag and
> > xa_complete explained in the XA specification. The core transaction
> > manager call prepare, commit, rollback APIs with the flag, requiring
> > to execute the operation asynchronously and to return a handler (e.g.,
> > a socket taken by PQsocket in postgres_fdw case) to the transaction
> > manager. Then the transaction manager continues polling the handler
> > until it becomes readable and testing the completion using by
> > xa_complete() with no wait, until all foreign servers return OK on
> > xa_complete check.
>
> Unfortunately, even Oracle and Db2 don't support XA asynchronous execution for years.  Our DBMS Symfoware doesn't,
either. I don't expect other DBMSs support it. 
>
> Hmm, I'm afraid this may be one of the FDW's intractable walls for a serious scale-out DBMS.  If we define
asynchronousFDW routines for 2PC, postgres_fdw would be able to implement them by using libpq asynchronous functions.
Butother DBMSs can't ... 

I think it's not necessarily that all FDW implementations need to be
able to support xa_complete(). We can support both synchronous and
asynchronous executions of prepare/commit/rollback.

>
>
> > > Maybe we can consider VOLATILE functions update data.  That may be
> > overreaction, though.
> >
> > Sorry I don't understand that. The volatile functions are not pushed
> > down to the foreign servers in the first place, no?
>
> Ah, you're right.  Then, the choices are twofold: (1) trust users in that their functions don't update data or the
user'sclaim (specification) about it, and (2) get notification through FE/BE protocol that the remote transaction may
haveupdated data. 
>

I'm confused about the point you're concerned about the UDF function.
If you're concerned that executing a UDF function by like 'SELECT
myfunc();' updates data on a foreign server, since the UDF should know
which foreign server it modifies data on it should be able to register
the foreign server and mark as modified. Or you’re concerned that a
UDF function in WHERE condition is pushed down and updates data (e.g.,
 ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

22 September 2020, 01:17:38

From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> Yes, but it still seems hard to me that we require for all FDW
> implementations to commit/rollback prepared transactions without the
> possibility of ERROR.

Of course we can't eliminate the possibility of error, because remote servers require network communication.  What I'm
sayingis to just require the FDW to return error like xa_commit(), not throwing control away with ereport(ERROR).  I
don'tthink it's too strict.
 


> I think it's not necessarily that all FDW implementations need to be
> able to support xa_complete(). We can support both synchronous and
> asynchronous executions of prepare/commit/rollback.

Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API.
 


> If you're concerned that executing a UDF function by like 'SELECT
> myfunc();' updates data on a foreign server, since the UDF should know
> which foreign server it modifies data on it should be able to register
> the foreign server and mark as modified. Or you’re concerned that a
> UDF function in WHERE condition is pushed down and updates data (e.g.,
>  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?

What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  Does the UDF call get pushed down to the
foreignserver in this case?  If not now, could it be pushed down in the future?  If it could be, it's worth considering
howto detect the remote update now.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ashutosh Bapat

Date:

22 September 2020, 12:55:57

On Tue, Sep 22, 2020 at 6:48 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API. 

parallelism here has both pros and cons. If one of the servers errors
out while preparing for a transaction, there is no point in preparing
the transaction on other servers. In parallel execution we will
prepare on multiple servers before realising that one of them has
failed to do so. On the other hand preparing on multiple servers in
parallel provides a speed up.

But this can be an improvement on version 1. The current approach
doesn't render such an improvement impossible. So if that's something
hard to do, we should do that in the next version rather than
complicating this patch.

--
Best Wishes,
Ashutosh Bapat

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

22 September 2020, 20:42:50

From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> parallelism here has both pros and cons. If one of the servers errors
> out while preparing for a transaction, there is no point in preparing
> the transaction on other servers. In parallel execution we will
> prepare on multiple servers before realising that one of them has
> failed to do so. On the other hand preparing on multiple servers in
> parallel provides a speed up.

And pros are dominant in practice.  If many transactions are erroring out (during prepare), the system is not
functioningfor the user.  Such an application should be corrected before they are put into production.
 


> But this can be an improvement on version 1. The current approach
> doesn't render such an improvement impossible. So if that's something
> hard to do, we should do that in the next version rather than
> complicating this patch.

Could you share your idea on how the current approach could enable parallelism?  This is an important point, because
(1)the FDW may not lead us to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or implementation
couldbe considered for non-parallel interaction if we have the realization of parallelism in mind.  I think that kind
ofconsideration is the design (for the future).
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ashutosh Bapat

Date:

23 September 2020, 13:36:01

On Wed, Sep 23, 2020 at 2:13 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > parallelism here has both pros and cons. If one of the servers errors
> > out while preparing for a transaction, there is no point in preparing
> > the transaction on other servers. In parallel execution we will
> > prepare on multiple servers before realising that one of them has
> > failed to do so. On the other hand preparing on multiple servers in
> > parallel provides a speed up.
>
> And pros are dominant in practice.  If many transactions are erroring out (during prepare), the system is not
functioningfor the user.  Such an application should be corrected before they are put into production. 
>
>
> > But this can be an improvement on version 1. The current approach
> > doesn't render such an improvement impossible. So if that's something
> > hard to do, we should do that in the next version rather than
> > complicating this patch.
>
> Could you share your idea on how the current approach could enable parallelism?  This is an important point, because
(1)the FDW may not lead us to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or implementation
couldbe considered for non-parallel interaction if we have the realization of parallelism in mind.  I think that kind
ofconsideration is the design (for the future). 
>

The way I am looking at is to put the parallelism in the resolution
worker and not in the FDW. If we use multiple resolution workers, they
can fire commit/abort on multiple foreign servers at a time.

But if we want parallelism within a single resolution worker, we will
need a separate FDW APIs for firing asynchronous commit/abort prepared
txn and fetching their results resp. But given the variety of FDWs,
not all of them will support asynchronous API, so we have to support
synchronous API anyway, which is what can be targeted in the first
version.

Thinking more about it, the core may support an API which accepts a
list of prepared transactions, their foreign servers and user mappings
and let FDW resolve all those either in parallel or one by one. So
parallelism is responsibility of FDW and not the core. But then we
loose parallelism across FDWs, which may not be a common case.

Given the complications around this, I think we should go ahead
supporting synchronous API first and in second version introduce
optional asynchronous API.

--
Best Wishes,
Ashutosh Bapat

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

23 September 2020, 21:51:46

On Tue, 22 Sep 2020 at 10:17, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > Yes, but it still seems hard to me that we require for all FDW
> > implementations to commit/rollback prepared transactions without the
> > possibility of ERROR.
>
> Of course we can't eliminate the possibility of error, because remote servers require network communication.  What
I'msaying is to just require the FDW to return error like xa_commit(), not throwing control away with ereport(ERROR).
Idon't think it's too strict. 

So with your idea, I think we require FDW developers to not call
ereport(ERROR) as much as possible. If they need to use a function
including palloc, lappend etc that could call ereport(ERROR), they
need to use PG_TRY() and PG_CATCH() and return the control along with
the error message to the transaction manager rather than raising an
error. Then the transaction manager will emit the error message at an
error level lower than ERROR (e.g., WARNING), and call commit/rollback
API again. But normally we do some cleanup on error but in this case
the retrying commit/rollback is performed without any cleanup. Is that
right? I’m not sure it’s safe though.

>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I don't think it's an option for a serious
scale-outDBMS.  If we want to use FDW as part of PostgreSQL's scale-out infrastructure, we should design (if not
implementedin the first version) how the parallelism can be realized.  That design is also necessary because it could
affectthe FDW API. 
>
>
> > If you're concerned that executing a UDF function by like 'SELECT
> > myfunc();' updates data on a foreign server, since the UDF should know
> > which foreign server it modifies data on it should be able to register
> > the foreign server and mark as modified. Or you’re concerned that a
> > UDF function in WHERE condition is pushed down and updates data (e.g.,
> >  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?
>
> What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  Does the UDF call get pushed down to the
foreignserver in this case?  If not now, could it be pushed down in the future?  If it could be, it's worth considering
howto detect the remote update now. 

IIUC aggregation functions can be pushed down to the foreign server
but I have not idea the normal UDF in the select list is pushed down.
I wonder if it isn't.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

24 September 2020, 08:23:25

From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> So with your idea, I think we require FDW developers to not call
> ereport(ERROR) as much as possible. If they need to use a function
> including palloc, lappend etc that could call ereport(ERROR), they
> need to use PG_TRY() and PG_CATCH() and return the control along with
> the error message to the transaction manager rather than raising an
> error. Then the transaction manager will emit the error message at an
> error level lower than ERROR (e.g., WARNING), and call commit/rollback
> API again. But normally we do some cleanup on error but in this case
> the retrying commit/rollback is performed without any cleanup. Is that
> right? I’m not sure it’s safe though.


Yes.  It's legitimate to require the FDW commit routine to return control, because the prepare of 2PC is a promise to
commitsuccessfully.  The second-phase commit should avoid doing that could fail.  For example, if some memory is needed
forcommit, it should be allocated in prepare or before.
 


> IIUC aggregation functions can be pushed down to the foreign server
> but I have not idea the normal UDF in the select list is pushed down.
> I wonder if it isn't.

Oh, that's the current situation.  Understood.  I thought the UDF call is also pushed down, as I saw Greenplum does so.
(Reading the manual, Greenplum disallows data updates in the UDF when it's executed on the remote segment server.)
 

(Aren't we overlooking something else that updates data on the remote server while the local server is unaware?)


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

24 September 2020, 11:53:39

On Fri, 18 Sep 2020 at 17:00, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 17 Sep 2020 at 14:25, Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Aug 21, 2020 at 03:25:29PM +0900, Masahiko Sawada wrote:
> > > Thank you for letting me know. I've attached the latest version patch set.
> >
> > A rebase is needed again as the CF bot is complaining.
>
> Thank you for letting me know. I'm updating the patch and splitting
> into small pieces as Fujii-san suggested. I'll submit the latest patch
> set early next week.
>

I've rebased the patch set and split into small pieces. Here are short
descriptions of each change:

v26-0001-Recreate-RemoveForeignServerById.patch

This commit recreates RemoveForeignServerById that was removed by
b1d32d3e3. This is necessary because we need to check if there is a
foreign transaction involved with the foreign server that is about to
be removed.

v26-0002-Introduce-transaction-manager-for-foreign-transa.patch

This commit adds the basic foreign transaction manager,
CommitForeignTransaction, and RollbackForeignTransaction API. These
APIs support only one-phase. With this change, FDW is able to control
its transaction using the foreign transaction manager, not using
XactCallback.

v26-0003-postgres_fdw-supports-commit-and-rollback-APIs.patch

This commit implements both CommitForeignTransaction and
RollbackForeignTransaction APIs in postgres_fdw. Note that since
PREPARE TRANSACTION is still not supported there is nothing the user
newly is able to do.

v26-0004-Add-PrepareForeignTransaction-API.patch

This commit adds prepared foreign transaction support including WAL
logging and recovery, and PrepareForeignTransaction API. With this
change, the user is able to do 'PREPARE TRANSACTION' and
'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
local transaction. It doesn't do anything for foreign transactions.
Therefore, the user needs to resolve foreign transactions manually by
executing the pg_resolve_foreign_xacts() SQL function which is also
introduced by this commit.

v26-0005-postgres_fdw-supports-prepare-API-and-support-co.patch

This commit implements PrepareForeignTransaction API and makes
CommitForeignTransaction and RollbackForeignTransaction supports
two-phase commit.

v26-0006-Add-GetPrepareID-API.patch

This commit adds GetPrepareID API.

v26-0007-Automatic-foreign-transaciton-resolution-on-COMM.patch

This commit adds the automatic foreign transaction resolution on
COMMIT/ROLLBACK PREPARED by using foreign transaction resolver and
launcher processes. With this change, the user is able to
commit/rollback the distributed transaction by COMMIT/ROLLBACK
PREPARED without manual resolution. The involved foreign transactions
are automatically resolved by a resolver process.

v26-0008-Automatic-foreign-transaciton-resolution-on-comm.patch

This commit adds the automatic foreign transaction resolution on
commit/rollback. With this change, the user is able to commit the
foreign transactions automatically on commit without executing PREPARE
TRANSACTION when foreign_twophase_commit is 'required'. IOW, we can
guarantee that all foreign transactions had been resolved when the
user got an acknowledgment of COMMIT.

v26-0009-postgres_fdw-supports-automatically-resolution.patch

This commit makes postgres_fdw supports the 0008 change.

v26-0010-Documentation-update.patch
v26-0011-Add-regression-tests-for-foreign-twophase-commit.patch

The above commits are documentation update and regression tests.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > So what's your opinion?
> >
> > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas (others
mayhave pointed out something else too, but I don't remember), before going deeper into the code review. 
> >
> > * FDW interface
> > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on to
validatethe FDW interface. 
> > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
>
> I guess since FDW interfaces may be affected by the feature
> architecture we can discuss later.
>
> > * Performance
> > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be
thefirst release quality.  I proposed the idea. 
> > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.  I
wantto keep Postgres's reputation.) 
>
> What is in your mind regarding the implementation of parallel prepare
> and commit? Given that some FDW plugins don't support asynchronous
> execution I guess we need to use parallel workers or something. That
> is, the backend process launches parallel workers to
> prepare/commit/rollback foreign transactions in parallel. I don't deny
> this approach but it'll definitely make the feature complex and needs
> more codes.
>
> My point is a small start and keeping simple the first version. Even
> if we need one or more years for this feature, I think that
> introducing the simple and minimum functionality as the first version
> to the core still has benefits. We will be able to have the
> opportunity to get real feedback from users and to fix bugs in the
> main infrastructure before making it complex. In this sense, the patch
> having the backend return without waits for resolution after the local
> commit would be a good start as the first version (i.g., up to
> applying v26-0006 patch). Anyway, the architecture should be
> extensible enough for future improvements.
>
> For the performance improvements, we will be able to support
> asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> resolver processes on one database would also help get better
> through-put. For the user who needs much better through-put, the user
> also can select not to wait for resolution after the local commit,
> like synchronous_commit = ‘local’ in replication.
>
> > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)
Thathelps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm
simplyimagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> >
> > Atomic Commit of Distributed Transactions
> > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
>
> Understood. I'll add an explanation about the message flow and disk
> writes to the wiki page.

Done.

>
> We need to consider the point of error handling during resolving
> foreign transactions too.
>
> >
> > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > guess the facts neither that we don’t stipulate anything about query
> > > cancellation now nor that postgres_fdw might not be cancellable in
> > > some situations now are not a reason for not supporting query
> > > cancellation. If it's a desirable behavior and users want it, we need
> > > to put an effort to support it as much as possible like we’ve done in
> > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > only by their functionality but it would be good if we can achieve
> > > that by combination of PostgreSQL and FDW plugins.
> >
> > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
>
> I suspect the story is somewhat different. libpq fortunately supports
> asynchronous execution, but when it comes to canceling the foreign
> transaction resolution I think basically all FDW plugins are in the
> same situation at this time. We can choose whether to make it
> cancellable or not. According to the discussion so far, it completely
> depends on the architecture of this feature. So my point is whether
> it's worth to have this functionality for users and whether users want
> it, not whether postgres_fdw is ok.
>

I've thought again about the idea that once the backend failed to
resolve a foreign transaction it leaves to a resolver process. With
this idea, the backend process perform the 2nd phase of 2PC only once.
If an error happens during resolution it leaves to a resolver process
and returns an error to the client. We used to use this idea in the
previous patches and it’s discussed sometimes.

First of all, this idea doesn’t resolve the problem of error handling
that the transaction could return an error to the client in spite of
having been committed the local transaction. There is an argument that
this behavior could also happen even in a single server environment
but I guess the situation is slightly different. Basically what the
transaction does after the commit is cleanup. An error could happen
during cleanup but if it happens it’s likely due to a  bug of
something wrong inside PostgreSQL or OS. On the other hand, during and
after resolution the transaction does major works such as connecting a
foreign server, sending an SQL, getting the result, and writing a WAL
to remove the entry. These are more likely to happen an error.

Also, with this idea, the client needs to check if the error got from
the server is really true because the local transaction might have
been committed. Although this could happen even in a single server
environment how many users check that in practice? If a server
crashes, subsequent transactions end up failing due to a network
connection error but it seems hard to distinguish between such a real
error and the fake error.

Moreover, it’s questionable in terms of extensibility. We would not
able to support keeping waiting for distributed transactions to
complete even if an error happens, like synchronous replication. The
user might want to wait in case where the failure is temporary such as
temporary network disconnection. Trying resolution only once seems to
have cons of both asynchronous and synchronous resolutions.

So I’m thinking that with this idea the user will need to change their
application so that it checks if the error they got is really true,
which is cumbersome for users. Also, it seems to me we need to
circumspectly discuss whether this idea could weaken extensibility.

Anyway, according to the discussion, it seems to me that we got a
consensus so far that the backend process prepares all foreign
transactions and a resolver process is necessary to resolve in-doubt
transaction in background. So I’ve changed the patch set as follows.
Applying these all patches, we can support asynchronous foreign
transaction resolution. That is, at transaction commit the backend
process prepares all foreign transactions, and then commit the local
transaction. After that, it returns OK of commit to the client while
leaving the prepared foreign transaction to a resolver process. A
resolver process fetches the foreign transactions to resolve and
resolves them in background. Since the 2nd phase of 2PC is performed
asynchronously a transaction that wants to see the previous
transaction result needs to check its status.

Here is brief explaination for each patches:

v27-0001-Introduce-transaction-manager-for-foreign-transa.patch

This commit adds the basic foreign transaction manager,
CommitForeignTransaction, and RollbackForeignTransaction API. These
APIs support only one-phase. With this change, FDW is able to control
its transaction using the foreign transaction manager, not using
XactCallback.

v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch

This commit implements both CommitForeignTransaction and
RollbackForeignTransaction APIs in postgres_fdw. Note that since
PREPARE TRANSACTION is still not supported there is nothing the user
newly is able to do.

v27-0003-Recreate-RemoveForeignServerById.patch

This commit recreates RemoveForeignServerById that was removed by
b1d32d3e3. This is necessary because we need to check if there is a
foreign transaction involved with the foreign server that is about to
be removed.

v27-0004-Add-PrepareForeignTransaction-API.patch

This commit adds prepared foreign transaction support including WAL
logging and recovery, and PrepareForeignTransaction API. With this
change, the user is able to do 'PREPARE TRANSACTION’ and
'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
local transaction. It doesn't do anything for foreign transactions.
Therefore, the user needs to resolve foreign transactions manually by
executing the pg_resolve_foreign_xacts() SQL function which is also
introduced by this commit.

v27-0005-postgres_fdw-supports-prepare-API.patch

This commit implements PrepareForeignTransaction API and makes
CommitForeignTransaction and RollbackForeignTransaction supports
two-phase commit.

v27-0006-Add-GetPrepareId-API.patch

This commit adds GetPrepareID API.

v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch

This commit introduces foreign transaction resolver and launcher
processes. With this change, the user doesn’t need to manually execute
pg_resolve_foreign_xacts() function to resolve foreign transactions
prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
Instead, a resolver process automatically resolves them in background.

v27-0008-Prepare-foreign-transactions-at-commit-time.patch

With this commit, the transaction prepares foreign transactions marked
as modified at transaction commit if foreign_twophase_commit is
‘required’. Previously the user needs to do PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
transparently to the user. But the transaction returns OK of commit to
the client after committing the local transaction and notifying the
resolver process, without waits. Foreign transactions are
asynchronously resolved by the resolver process.

v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch

With this commit, the transactions started via postgres_fdw are marked
as modified, which is necessary to use 2PC.

v27-0010-Documentation-update.patch
v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch

Documentation update and regression tests.

The missing piece from the previous version patch is synchronously
transaction resolution. In the previous patch, foreign transactions
are synchronously resolved by a resolver process. But since it's under
discussion whether this is a good approach and I'm considering
optimizing the logic it’s not included in the current patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

08 November 2020, 05:11:41

On Thu, Nov 5, 2020 at 12:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > >
> > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > So what's your opinion?
> > >
> > > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas
(othersmay have pointed out something else too, but I don't remember), before going deeper into the code review. 
> > >
> > > * FDW interface
> > > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on
tovalidate the FDW interface. 
> > > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
> >
> > I guess since FDW interfaces may be affected by the feature
> > architecture we can discuss later.
> >
> > > * Performance
> > > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not be
thefirst release quality.  I proposed the idea. 
> > > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.
Iwant to keep Postgres's reputation.) 
> >
> > What is in your mind regarding the implementation of parallel prepare
> > and commit? Given that some FDW plugins don't support asynchronous
> > execution I guess we need to use parallel workers or something. That
> > is, the backend process launches parallel workers to
> > prepare/commit/rollback foreign transactions in parallel. I don't deny
> > this approach but it'll definitely make the feature complex and needs
> > more codes.
> >
> > My point is a small start and keeping simple the first version. Even
> > if we need one or more years for this feature, I think that
> > introducing the simple and minimum functionality as the first version
> > to the core still has benefits. We will be able to have the
> > opportunity to get real feedback from users and to fix bugs in the
> > main infrastructure before making it complex. In this sense, the patch
> > having the backend return without waits for resolution after the local
> > commit would be a good start as the first version (i.g., up to
> > applying v26-0006 patch). Anyway, the architecture should be
> > extensible enough for future improvements.
> >
> > For the performance improvements, we will be able to support
> > asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> > resolver processes on one database would also help get better
> > through-put. For the user who needs much better through-put, the user
> > also can select not to wait for resolution after the local commit,
> > like synchronous_commit = ‘local’ in replication.
> >
> > > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following wiki.)
That helps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.  I'm
simplyimagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> > >
> > > Atomic Commit of Distributed Transactions
> > > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> >
> > Understood. I'll add an explanation about the message flow and disk
> > writes to the wiki page.
>
> Done.
>
> >
> > We need to consider the point of error handling during resolving
> > foreign transactions too.
> >
> > >
> > > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > > guess the facts neither that we don’t stipulate anything about query
> > > > cancellation now nor that postgres_fdw might not be cancellable in
> > > > some situations now are not a reason for not supporting query
> > > > cancellation. If it's a desirable behavior and users want it, we need
> > > > to put an effort to support it as much as possible like we’ve done in
> > > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > > only by their functionality but it would be good if we can achieve
> > > > that by combination of PostgreSQL and FDW plugins.
> > >
> > > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
> >
> > I suspect the story is somewhat different. libpq fortunately supports
> > asynchronous execution, but when it comes to canceling the foreign
> > transaction resolution I think basically all FDW plugins are in the
> > same situation at this time. We can choose whether to make it
> > cancellable or not. According to the discussion so far, it completely
> > depends on the architecture of this feature. So my point is whether
> > it's worth to have this functionality for users and whether users want
> > it, not whether postgres_fdw is ok.
> >
>
> I've thought again about the idea that once the backend failed to
> resolve a foreign transaction it leaves to a resolver process. With
> this idea, the backend process perform the 2nd phase of 2PC only once.
> If an error happens during resolution it leaves to a resolver process
> and returns an error to the client. We used to use this idea in the
> previous patches and it’s discussed sometimes.
>
> First of all, this idea doesn’t resolve the problem of error handling
> that the transaction could return an error to the client in spite of
> having been committed the local transaction. There is an argument that
> this behavior could also happen even in a single server environment
> but I guess the situation is slightly different. Basically what the
> transaction does after the commit is cleanup. An error could happen
> during cleanup but if it happens it’s likely due to a  bug of
> something wrong inside PostgreSQL or OS. On the other hand, during and
> after resolution the transaction does major works such as connecting a
> foreign server, sending an SQL, getting the result, and writing a WAL
> to remove the entry. These are more likely to happen an error.
>
> Also, with this idea, the client needs to check if the error got from
> the server is really true because the local transaction might have
> been committed. Although this could happen even in a single server
> environment how many users check that in practice? If a server
> crashes, subsequent transactions end up failing due to a network
> connection error but it seems hard to distinguish between such a real
> error and the fake error.
>
> Moreover, it’s questionable in terms of extensibility. We would not
> able to support keeping waiting for distributed transactions to
> complete even if an error happens, like synchronous replication. The
> user might want to wait in case where the failure is temporary such as
> temporary network disconnection. Trying resolution only once seems to
> have cons of both asynchronous and synchronous resolutions.
>
> So I’m thinking that with this idea the user will need to change their
> application so that it checks if the error they got is really true,
> which is cumbersome for users. Also, it seems to me we need to
> circumspectly discuss whether this idea could weaken extensibility.
>
>
> Anyway, according to the discussion, it seems to me that we got a
> consensus so far that the backend process prepares all foreign
> transactions and a resolver process is necessary to resolve in-doubt
> transaction in background. So I’ve changed the patch set as follows.
> Applying these all patches, we can support asynchronous foreign
> transaction resolution. That is, at transaction commit the backend
> process prepares all foreign transactions, and then commit the local
> transaction. After that, it returns OK of commit to the client while
> leaving the prepared foreign transaction to a resolver process. A
> resolver process fetches the foreign transactions to resolve and
> resolves them in background. Since the 2nd phase of 2PC is performed
> asynchronously a transaction that wants to see the previous
> transaction result needs to check its status.
>
> Here is brief explaination for each patches:
>
> v27-0001-Introduce-transaction-manager-for-foreign-transa.patch
>
> This commit adds the basic foreign transaction manager,
> CommitForeignTransaction, and RollbackForeignTransaction API. These
> APIs support only one-phase. With this change, FDW is able to control
> its transaction using the foreign transaction manager, not using
> XactCallback.
>
> v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch
>
> This commit implements both CommitForeignTransaction and
> RollbackForeignTransaction APIs in postgres_fdw. Note that since
> PREPARE TRANSACTION is still not supported there is nothing the user
> newly is able to do.
>
> v27-0003-Recreate-RemoveForeignServerById.patch
>
> This commit recreates RemoveForeignServerById that was removed by
> b1d32d3e3. This is necessary because we need to check if there is a
> foreign transaction involved with the foreign server that is about to
> be removed.
>
> v27-0004-Add-PrepareForeignTransaction-API.patch
>
> This commit adds prepared foreign transaction support including WAL
> logging and recovery, and PrepareForeignTransaction API. With this
> change, the user is able to do 'PREPARE TRANSACTION’ and
> 'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
> foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
> local transaction. It doesn't do anything for foreign transactions.
> Therefore, the user needs to resolve foreign transactions manually by
> executing the pg_resolve_foreign_xacts() SQL function which is also
> introduced by this commit.
>
> v27-0005-postgres_fdw-supports-prepare-API.patch
>
> This commit implements PrepareForeignTransaction API and makes
> CommitForeignTransaction and RollbackForeignTransaction supports
> two-phase commit.
>
> v27-0006-Add-GetPrepareId-API.patch
>
> This commit adds GetPrepareID API.
>
> v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch
>
> This commit introduces foreign transaction resolver and launcher
> processes. With this change, the user doesn’t need to manually execute
> pg_resolve_foreign_xacts() function to resolve foreign transactions
> prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
> Instead, a resolver process automatically resolves them in background.
>
> v27-0008-Prepare-foreign-transactions-at-commit-time.patch
>
> With this commit, the transaction prepares foreign transactions marked
> as modified at transaction commit if foreign_twophase_commit is
> ‘required’. Previously the user needs to do PREPARE TRANSACTION and
> COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
> transparently to the user. But the transaction returns OK of commit to
> the client after committing the local transaction and notifying the
> resolver process, without waits. Foreign transactions are
> asynchronously resolved by the resolver process.
>
> v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch
>
> With this commit, the transactions started via postgres_fdw are marked
> as modified, which is necessary to use 2PC.
>
> v27-0010-Documentation-update.patch
> v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch
>
> Documentation update and regression tests.
>
> The missing piece from the previous version patch is synchronously
> transaction resolution. In the previous patch, foreign transactions
> are synchronously resolved by a resolver process. But since it's under
> discussion whether this is a good approach and I'm considering
> optimizing the logic it’s not included in the current patch set.
>
>

Cfbot reported an error. I've attached the updated version patch set
to make cfbot happy.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

25 November 2020, 12:50:20

On Sun, Nov 8, 2020 at 2:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Nov 5, 2020 at 12:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Oct 22, 2020 at 10:39 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Wed, 21 Oct 2020 at 18:33, tsunakawa.takay@fujitsu.com
> > > <tsunakawa.takay@fujitsu.com> wrote:
> > > >
> > > > From: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
> > > > > So what's your opinion?
> > > >
> > > > My opinion is simple and has not changed.  Let's clarify and refine the design first in the following areas
(othersmay have pointed out something else too, but I don't remember), before going deeper into the code review. 
> > > >
> > > > * FDW interface
> > > > New functions so that other FDWs can really implement.  Currently, XA seems to be the only model we can rely on
tovalidate the FDW interface. 
> > > > What FDW function would call what XA function(s)?  What should be the arguments for the FEW functions?
> > >
> > > I guess since FDW interfaces may be affected by the feature
> > > architecture we can discuss later.
> > >
> > > > * Performance
> > > > Parallel prepare and commits on the client backend.  The current implementation is untolerable and should not
bethe first release quality.  I proposed the idea. 
> > > > (If you insist you don't want to anything about this, I have to think you're just rushing for the patch commit.
I want to keep Postgres's reputation.) 
> > >
> > > What is in your mind regarding the implementation of parallel prepare
> > > and commit? Given that some FDW plugins don't support asynchronous
> > > execution I guess we need to use parallel workers or something. That
> > > is, the backend process launches parallel workers to
> > > prepare/commit/rollback foreign transactions in parallel. I don't deny
> > > this approach but it'll definitely make the feature complex and needs
> > > more codes.
> > >
> > > My point is a small start and keeping simple the first version. Even
> > > if we need one or more years for this feature, I think that
> > > introducing the simple and minimum functionality as the first version
> > > to the core still has benefits. We will be able to have the
> > > opportunity to get real feedback from users and to fix bugs in the
> > > main infrastructure before making it complex. In this sense, the patch
> > > having the backend return without waits for resolution after the local
> > > commit would be a good start as the first version (i.g., up to
> > > applying v26-0006 patch). Anyway, the architecture should be
> > > extensible enough for future improvements.
> > >
> > > For the performance improvements, we will be able to support
> > > asynchronous and/or prepare/commit/rollback. Moreover, having multiple
> > > resolver processes on one database would also help get better
> > > through-put. For the user who needs much better through-put, the user
> > > also can select not to wait for resolution after the local commit,
> > > like synchronous_commit = ‘local’ in replication.
> > >
> > > > As part of this, I'd like to see the 2PC's message flow and disk writes (via email and/or on the following
wiki.) That helps evaluate the 2PC performance, because it's hard to figure it out in the code of a large patch set.
I'msimply imagining what is typically written in database textbooks and research papers.  I'm asking this because I saw
somediscussion in this thread that some new WAL records are added.  I was worried that transactions have to write WAL
recordsother than prepare and commit unlike textbook implementations. 
> > > >
> > > > Atomic Commit of Distributed Transactions
> > > > https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions
> > >
> > > Understood. I'll add an explanation about the message flow and disk
> > > writes to the wiki page.
> >
> > Done.
> >
> > >
> > > We need to consider the point of error handling during resolving
> > > foreign transactions too.
> > >
> > > >
> > > > > I don’t think we need to stipulate the query cancellation. Anyway I
> > > > > guess the facts neither that we don’t stipulate anything about query
> > > > > cancellation now nor that postgres_fdw might not be cancellable in
> > > > > some situations now are not a reason for not supporting query
> > > > > cancellation. If it's a desirable behavior and users want it, we need
> > > > > to put an effort to support it as much as possible like we’ve done in
> > > > > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > > > > only by their functionality but it would be good if we can achieve
> > > > > that by combination of PostgreSQL and FDW plugins.
> > > >
> > > > Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We need to pay attention to the FDW
interfaceand its documentation so that FDW developers can implement what we consider important -- query cancellation in
yourdiscussion.  "postgres_fdw is OK, so the interface is good" can create interfaces that other FDW developers can't
use. That's what Tomas Vondra pointed out several years ago. 
> > >
> > > I suspect the story is somewhat different. libpq fortunately supports
> > > asynchronous execution, but when it comes to canceling the foreign
> > > transaction resolution I think basically all FDW plugins are in the
> > > same situation at this time. We can choose whether to make it
> > > cancellable or not. According to the discussion so far, it completely
> > > depends on the architecture of this feature. So my point is whether
> > > it's worth to have this functionality for users and whether users want
> > > it, not whether postgres_fdw is ok.
> > >
> >
> > I've thought again about the idea that once the backend failed to
> > resolve a foreign transaction it leaves to a resolver process. With
> > this idea, the backend process perform the 2nd phase of 2PC only once.
> > If an error happens during resolution it leaves to a resolver process
> > and returns an error to the client. We used to use this idea in the
> > previous patches and it’s discussed sometimes.
> >
> > First of all, this idea doesn’t resolve the problem of error handling
> > that the transaction could return an error to the client in spite of
> > having been committed the local transaction. There is an argument that
> > this behavior could also happen even in a single server environment
> > but I guess the situation is slightly different. Basically what the
> > transaction does after the commit is cleanup. An error could happen
> > during cleanup but if it happens it’s likely due to a  bug of
> > something wrong inside PostgreSQL or OS. On the other hand, during and
> > after resolution the transaction does major works such as connecting a
> > foreign server, sending an SQL, getting the result, and writing a WAL
> > to remove the entry. These are more likely to happen an error.
> >
> > Also, with this idea, the client needs to check if the error got from
> > the server is really true because the local transaction might have
> > been committed. Although this could happen even in a single server
> > environment how many users check that in practice? If a server
> > crashes, subsequent transactions end up failing due to a network
> > connection error but it seems hard to distinguish between such a real
> > error and the fake error.
> >
> > Moreover, it’s questionable in terms of extensibility. We would not
> > able to support keeping waiting for distributed transactions to
> > complete even if an error happens, like synchronous replication. The
> > user might want to wait in case where the failure is temporary such as
> > temporary network disconnection. Trying resolution only once seems to
> > have cons of both asynchronous and synchronous resolutions.
> >
> > So I’m thinking that with this idea the user will need to change their
> > application so that it checks if the error they got is really true,
> > which is cumbersome for users. Also, it seems to me we need to
> > circumspectly discuss whether this idea could weaken extensibility.
> >
> >
> > Anyway, according to the discussion, it seems to me that we got a
> > consensus so far that the backend process prepares all foreign
> > transactions and a resolver process is necessary to resolve in-doubt
> > transaction in background. So I’ve changed the patch set as follows.
> > Applying these all patches, we can support asynchronous foreign
> > transaction resolution. That is, at transaction commit the backend
> > process prepares all foreign transactions, and then commit the local
> > transaction. After that, it returns OK of commit to the client while
> > leaving the prepared foreign transaction to a resolver process. A
> > resolver process fetches the foreign transactions to resolve and
> > resolves them in background. Since the 2nd phase of 2PC is performed
> > asynchronously a transaction that wants to see the previous
> > transaction result needs to check its status.
> >
> > Here is brief explaination for each patches:
> >
> > v27-0001-Introduce-transaction-manager-for-foreign-transa.patch
> >
> > This commit adds the basic foreign transaction manager,
> > CommitForeignTransaction, and RollbackForeignTransaction API. These
> > APIs support only one-phase. With this change, FDW is able to control
> > its transaction using the foreign transaction manager, not using
> > XactCallback.
> >
> > v27-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch
> >
> > This commit implements both CommitForeignTransaction and
> > RollbackForeignTransaction APIs in postgres_fdw. Note that since
> > PREPARE TRANSACTION is still not supported there is nothing the user
> > newly is able to do.
> >
> > v27-0003-Recreate-RemoveForeignServerById.patch
> >
> > This commit recreates RemoveForeignServerById that was removed by
> > b1d32d3e3. This is necessary because we need to check if there is a
> > foreign transaction involved with the foreign server that is about to
> > be removed.
> >
> > v27-0004-Add-PrepareForeignTransaction-API.patch
> >
> > This commit adds prepared foreign transaction support including WAL
> > logging and recovery, and PrepareForeignTransaction API. With this
> > change, the user is able to do 'PREPARE TRANSACTION’ and
> > 'COMMIT/ROLLBACK PREPARED' commands on the transaction that involves
> > foreign servers. But note that COMMIT/ROLLBACK PREPARED ends only the
> > local transaction. It doesn't do anything for foreign transactions.
> > Therefore, the user needs to resolve foreign transactions manually by
> > executing the pg_resolve_foreign_xacts() SQL function which is also
> > introduced by this commit.
> >
> > v27-0005-postgres_fdw-supports-prepare-API.patch
> >
> > This commit implements PrepareForeignTransaction API and makes
> > CommitForeignTransaction and RollbackForeignTransaction supports
> > two-phase commit.
> >
> > v27-0006-Add-GetPrepareId-API.patch
> >
> > This commit adds GetPrepareID API.
> >
> > v27-0007-Introduce-foreign-transaction-launcher-and-resol.patch
> >
> > This commit introduces foreign transaction resolver and launcher
> > processes. With this change, the user doesn’t need to manually execute
> > pg_resolve_foreign_xacts() function to resolve foreign transactions
> > prepared by PREPARE TRANSACTION and left by COMMIT/ROLLBACK PREPARED.
> > Instead, a resolver process automatically resolves them in background.
> >
> > v27-0008-Prepare-foreign-transactions-at-commit-time.patch
> >
> > With this commit, the transaction prepares foreign transactions marked
> > as modified at transaction commit if foreign_twophase_commit is
> > ‘required’. Previously the user needs to do PREPARE TRANSACTION and
> > COMMIT/ROLLBACK PREPARED to use 2PC but it enables us to use 2PC
> > transparently to the user. But the transaction returns OK of commit to
> > the client after committing the local transaction and notifying the
> > resolver process, without waits. Foreign transactions are
> > asynchronously resolved by the resolver process.
> >
> > v27-0009-postgres_fdw-marks-foreign-transaction-as-modifi.patch
> >
> > With this commit, the transactions started via postgres_fdw are marked
> > as modified, which is necessary to use 2PC.
> >
> > v27-0010-Documentation-update.patch
> > v27-0011-Add-regression-tests-for-foreign-twophase-commit.patch
> >
> > Documentation update and regression tests.
> >
> > The missing piece from the previous version patch is synchronously
> > transaction resolution. In the previous patch, foreign transactions
> > are synchronously resolved by a resolver process. But since it's under
> > discussion whether this is a good approach and I'm considering
> > optimizing the logic it’s not included in the current patch set.
> >
> >
>
> Cfbot reported an error. I've attached the updated version patch set
> to make cfbot happy.

Since the previous version conflicts with the current HEAD I've
attached the rebased version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

28 December 2020, 14:24:10

On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Since the previous version conflicts with the current HEAD I've
> attached the rebased version patch set.

Rebased the patch set again to the current HEAD.

The discussion of this patch is very long so here is a short summary
of the current state:

It’s still under discussion which approaches are the best for the
distributed transaction commit as a building block of built-in sharing
using foreign data wrappers.

Since we’re considering that we use this feature for built-in
sharding, the design depends on the architecture of built-in sharding.
For example, with the current patch, the PostgreSQL node that received
a COMMIT from the client works as a coordinator and it commits the
transactions using 2PC on all foreign servers involved with the
transaction. This approach would be good with the de-centralized
sharding architecture but not with centralized architecture like the
GTM node of Postgres-XC and Postgres-XL that is a dedicated component
that is responsible for transaction management. Since we don't get a
consensus on the built-in sharding architecture yet, it's still an
open question that this patch's approach is really good as a building
block of the built-in sharding.

On the other hand, this feature is not necessarily dedicated to the
built-in sharding. For example, the distributed transaction commit
through FDW is important also when atomically moving data between two
servers via FDWs. Using a dedicated process or server like GTM could
be an over solution. Having the node that received a COMMIT work as a
coordinator would be better and straight forward.

There is no noticeable TODO in the functionality so far covered by
this patch set. This patchset adds new FDW APIs to support 2PC,
introduces the global transaction manager, and implement those FDW
APIs to postgres_fdw. Also, it has regression tests and documentation.
Transactions on foreign servers involved with the distributed
transaction are committed using 2PC. Committing using 2PC is performed
asynchronously and transparently to the user. Therefore, it doesn’t
guarantee that transactions on the foreign server are also committed
when the client gets an acknowledgment of COMMIT. The patch doesn't
cover synchronous foreign transaction commit via 2PC is not covered by
this patch as we still need a discussion on the design.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

07 January 2021, 01:44:50

On Mon, Dec 28, 2020 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Since the previous version conflicts with the current HEAD I've
> > attached the rebased version patch set.
>
> Rebased the patch set again to the current HEAD.
>
> The discussion of this patch is very long so here is a short summary
> of the current state:
>
> It’s still under discussion which approaches are the best for the
> distributed transaction commit as a building block of built-in sharing
> using foreign data wrappers.
>
> Since we’re considering that we use this feature for built-in
> sharding, the design depends on the architecture of built-in sharding.
> For example, with the current patch, the PostgreSQL node that received
> a COMMIT from the client works as a coordinator and it commits the
> transactions using 2PC on all foreign servers involved with the
> transaction. This approach would be good with the de-centralized
> sharding architecture but not with centralized architecture like the
> GTM node of Postgres-XC and Postgres-XL that is a dedicated component
> that is responsible for transaction management. Since we don't get a
> consensus on the built-in sharding architecture yet, it's still an
> open question that this patch's approach is really good as a building
> block of the built-in sharding.
>
> On the other hand, this feature is not necessarily dedicated to the
> built-in sharding. For example, the distributed transaction commit
> through FDW is important also when atomically moving data between two
> servers via FDWs. Using a dedicated process or server like GTM could
> be an over solution. Having the node that received a COMMIT work as a
> coordinator would be better and straight forward.
>
> There is no noticeable TODO in the functionality so far covered by
> this patch set. This patchset adds new FDW APIs to support 2PC,
> introduces the global transaction manager, and implement those FDW
> APIs to postgres_fdw. Also, it has regression tests and documentation.
> Transactions on foreign servers involved with the distributed
> transaction are committed using 2PC. Committing using 2PC is performed
> asynchronously and transparently to the user. Therefore, it doesn’t
> guarantee that transactions on the foreign server are also committed
> when the client gets an acknowledgment of COMMIT. The patch doesn't
> cover synchronous foreign transaction commit via 2PC is not covered by
> this patch as we still need a discussion on the design.
>

I've attached the rebased patches to make cfbot happy.


Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

07 January 2021, 02:45:29

Hi,

For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :

However these functions are not neither committed nor aborted at

I think the double negation was not intentional. Should be 'are neither ...'

For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

+ fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);

For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

+ * we don't need to anything for this participant because all foreign

'need to' -> 'need to do'

+ else if (TransactionIdDidAbort(xid))
+ return FDWXACT_STATUS_ABORTING;
+

the 'else' can be omitted since the preceding if would return.

+ if (max_prepared_foreign_xacts <= 0)

I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).

Cheers

On Wed, Jan 6, 2021 at 5:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 28, 2020 at 11:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Nov 25, 2020 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Since the previous version conflicts with the current HEAD I've
> > attached the rebased version patch set.
>
> Rebased the patch set again to the current HEAD.
>
> The discussion of this patch is very long so here is a short summary
> of the current state:
>
> It’s still under discussion which approaches are the best for the
> distributed transaction commit as a building block of built-in sharing
> using foreign data wrappers.
>
> Since we’re considering that we use this feature for built-in
> sharding, the design depends on the architecture of built-in sharding.
> For example, with the current patch, the PostgreSQL node that received
> a COMMIT from the client works as a coordinator and it commits the
> transactions using 2PC on all foreign servers involved with the
> transaction. This approach would be good with the de-centralized
> sharding architecture but not with centralized architecture like the
> GTM node of Postgres-XC and Postgres-XL that is a dedicated component
> that is responsible for transaction management. Since we don't get a
> consensus on the built-in sharding architecture yet, it's still an
> open question that this patch's approach is really good as a building
> block of the built-in sharding.
>
> On the other hand, this feature is not necessarily dedicated to the
> built-in sharding. For example, the distributed transaction commit
> through FDW is important also when atomically moving data between two
> servers via FDWs. Using a dedicated process or server like GTM could
> be an over solution. Having the node that received a COMMIT work as a
> coordinator would be better and straight forward.
>
> There is no noticeable TODO in the functionality so far covered by
> this patch set. This patchset adds new FDW APIs to support 2PC,
> introduces the global transaction manager, and implement those FDW
> APIs to postgres_fdw. Also, it has regression tests and documentation.
> Transactions on foreign servers involved with the distributed
> transaction are committed using 2PC. Committing using 2PC is performed
> asynchronously and transparently to the user. Therefore, it doesn’t
> guarantee that transactions on the foreign server are also committed
> when the client gets an acknowledgment of COMMIT. The patch doesn't
> cover synchronous foreign transaction commit via 2PC is not covered by
> this patch as we still need a discussion on the design.
>

I've attached the rebased patches to make cfbot happy.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

14 January 2021, 05:50:01

On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
    Size        size;

    /* Size for foreign transaction information array */
    size = offsetof(FdwXactCtlData, fdwxacts);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXact)));
    size = MAXALIGN(size);
    size = add_size(size, mul_size(max_prepared_foreign_xacts,
                                   sizeof(FdwXactData)));

    return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> +       fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> +        * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> +   else if (TransactionIdDidAbort(xid))
> +       return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> +   if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an
error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

14 January 2021, 19:04:30

Hi,

For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

+ bool have_notwophase = false;

Maybe name the variable have_no_twophase so that it is easier to read.

+ * Two-phase commit is not required if the number of servers performed

performed -> performing

+ errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
+ errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));

The lines are really long. Please wrap into more lines.

On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
Size size;

/* Size for foreign transaction information array */
size = offsetof(FdwXactCtlData, fdwxacts);
size = add_size(size, mul_size(max_prepared_foreign_xacts,
sizeof(FdwXact)));
size = MAXALIGN(size);
size = add_size(size, mul_size(max_prepared_foreign_xacts,
sizeof(FdwXactData)));

return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> + fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> + * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> + else if (TransactionIdDidAbort(xid))
> + return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> + if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

14 January 2021, 22:46:40

For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :

+ entry->changing_xact_state = true;

...

+ entry->changing_xact_state = abort_cleanup_failure;

I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true, and later being assigned again.

For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :

bq. This commits introduces to new background processes: foreign

commits introduces to new -> commit introduces two new

+FdwXactExistsXid(TransactionId xid)

Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

Please correct year in the next patch set.

+FdwXactLauncherRequestToLaunch(void)

Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)

Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.

Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

+fdwxact_launch_resolver(Oid dbid)

The above method is not in camel case. It would be better if method names are consistent (in casing).

+ errmsg("out of foreign transaction resolver slots"),
+ errhint("You might need to increase max_foreign_transaction_resolvers.")));

It would be nice to include the value of max_foreign_xact_resolvers

For fdwxact_resolver_onexit():

+ LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+ fdwxact->locking_backend = InvalidBackendId;
+ LWLockRelease(FdwXactLock);

There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the for loop and released coming out of the for loop.

+FXRslvLoop(void)

Please use Resolver instead of Rslv

+ FdwXactResolveFdwXacts(held_fdwxacts, nheld);

Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Cheers

On Thu, Jan 14, 2021 at 11:04 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi,
For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

+ bool have_notwophase = false;

Maybe name the variable have_no_twophase so that it is easier to read.

+ * Two-phase commit is not required if the number of servers performed

performed -> performing

+ errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
+ errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));

The lines are really long. Please wrap into more lines.

On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,

Thank you for reviewing the patch!

> For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>
> However these functions are not neither committed nor aborted at
>
> I think the double negation was not intentional. Should be 'are neither ...'

Fixed.

>
> For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the return statement ?

Hmm, you mean that we need MAXALIGN(size) after adding the size of
FdwXactData structs?

Size
FdwXactShmemSize(void)
{
Size size;

/* Size for foreign transaction information array */
size = offsetof(FdwXactCtlData, fdwxacts);
size = add_size(size, mul_size(max_prepared_foreign_xacts,
sizeof(FdwXact)));
size = MAXALIGN(size);
size = add_size(size, mul_size(max_prepared_foreign_xacts,
sizeof(FdwXactData)));

return size;
}

I don't think we need to do that. Looking at other similar code such
as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?

>
> + fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>
> For the function name, Fdw and Xact appear twice, each. Maybe one of them can be dropped ?

Agreed. Changed to FdwXactInsertEntry().

>
> + * we don't need to anything for this participant because all foreign
>
> 'need to' -> 'need to do'

Fixed.

>
> + else if (TransactionIdDidAbort(xid))
> + return FDWXACT_STATUS_ABORTING;
> +
> the 'else' can be omitted since the preceding if would return.

Fixed.

>
> + if (max_prepared_foreign_xacts <= 0)
>
> I wonder when the value for max_prepared_foreign_xacts would be negative (and whether that should be considered an error).
>

Fixed to (max_prepared_foreign_xacts == 0)

Attached the updated version patch set.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

15 January 2021, 05:17:12

On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> +   bool        have_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.

Fixed.

>
> +    * Two-phase commit is not required if the number of servers performed
>
> performed -> performing

Fixed.

>
> +                errmsg("cannot process a distributed transaction that has operated on a foreign server that does not
supporttwo-phase commit protocol"),

> +                errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers
whichare not capable of two-phase commit")));

>
> The lines are really long. Please wrap into more lines.

Hmm, we can do that but if we do that, it makes grepping by the error
message hard. Please refer to the documentation about the formatting
guideline[1]:

Limit line lengths so that the code is readable in an 80-column
window. (This doesn't mean that you must never go past 80 columns. For
instance, breaking a long error message string in arbitrary places
just to keep the code within 80 columns is probably not a net gain in
readability.)

These changes have been made in the local branch. I'll post the
updated patch set after incorporating all the comments.

Regards,

[1] https://www.postgresql.org/docs/devel/source-format.html

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

15 January 2021, 16:40:46

Hi,

For v32-0004-Add-PrepareForeignTransaction-API.patch :

+ * Whenever a foreign transaction is processed, the corresponding FdwXact
+ * entry is update. To avoid holding the lock during transaction processing

+ * which may take an unpredicatable time the in-memory data of foreign

entry is update -> entry is updated

unpredictable -> unpredictable

+ int nlefts = 0;

nlefts -> nremaining

+ elog(DEBUG1, "left %u foreign transactions", nlefts);

The message can be phrased as "%u foreign transactions remaining"

+FdwXactResolveFdwXacts(int *fdwxact_idxs, int nfdwxacts)

Fdw and Xact are repeated. Seems one should suffice. How about naming the method FdwXactResolveTransactions() ?

Similar comment for FdwXactResolveOneFdwXact(FdwXact fdwxact)

For get_fdwxact():

+ /* This entry matches the condition */
+ found = true;
+ break;

Instead of breaking and returning, you can return within the loop directly.

Cheers

On Thu, Jan 14, 2021 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> + bool have_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.

Fixed.

>
> + * Two-phase commit is not required if the number of servers performed
>
> performed -> performing

Fixed.

>
> + errmsg("cannot process a distributed transaction that has operated on a foreign server that does not support two-phase commit protocol"),
> + errdetail("foreign_twophase_commit is \'required\' but the transaction has some foreign servers which are not capable of two-phase commit")));
>
> The lines are really long. Please wrap into more lines.

Hmm, we can do that but if we do that, it makes grepping by the error
message hard. Please refer to the documentation about the formatting
guideline[1]:

Limit line lengths so that the code is readable in an 80-column
window. (This doesn't mean that you must never go past 80 columns. For
instance, breaking a long error message string in arbitrary places
just to keep the code within 80 columns is probably not a net gain in
readability.)

These changes have been made in the local branch. I'll post the
updated patch set after incorporating all the comments.

Regards,

[1] https://www.postgresql.org/docs/devel/source-format.html

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

18 January 2021, 05:54:33

On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>
> +   entry->changing_xact_state = true;
> ...
> +   entry->changing_xact_state = abort_cleanup_failure;
>
> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true,
andlater being assigned again.

Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

    /*
     * If connection is before starting transaction or is already unsalvageable,
     * do only the cleanup and don't touch it further.
     */
    if (entry->changing_xact_state)
    {
        pgfdw_cleanup_after_transaction(entry);
        return;
    }

>
> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>
> bq. This commits introduces to new background processes: foreign
>
> commits introduces to new -> commit introduces two new

Fixed.

>
> +FdwXactExistsXid(TransactionId xid)
>
> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?

>
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> Please correct year in the next patch set.

Fixed.

>
> +FdwXactLauncherRequestToLaunch(void)
>
> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

Agreed. How about FdwXactRequestToLaunchResolver()?

>
> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> +Size
> +FdwXactRslvShmemSize(void)
>
> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.

>
> +fdwxact_launch_resolver(Oid dbid)
>
> The above method is not in camel case. It would be better if method names are consistent (in casing).

Fixed.

>
> +                errmsg("out of foreign transaction resolver slots"),
> +                errhint("You might need to increase max_foreign_transaction_resolvers.")));
>
> It would be nice to include the value of max_foreign_xact_resolvers

I agree it would be nice but looking at other code we don't include
the value in this kind of messages.

>
> For fdwxact_resolver_onexit():
>
> +       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> +       fdwxact->locking_backend = InvalidBackendId;
> +       LWLockRelease(FdwXactLock);
>
> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the
forloop and released coming out of the for loop.

Agreed.

>
> +FXRslvLoop(void)
>
> Please use Resolver instead of Rslv

Fixed.

>
> +           FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>
> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Fixed.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

18 January 2021, 14:14:34

Hi, Masahiko-san:

bq. How about FdwXactRequestToLaunchResolver()?

Sounds good to me.

bq. But there is already a function named FdwXactExists()

Then we can leave the function name as it is.

Cheers

On Sun, Jan 17, 2021 at 9:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>
> + entry->changing_xact_state = true;
> ...
> + entry->changing_xact_state = abort_cleanup_failure;
>
> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true, and later being assigned again.

Because postgresRollbackForeignTransaction() can get called again in
case where an error occurred during aborting and cleanup the
transaction. For example, if an error occurred when executing ABORT
TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
postgresRollbackForeignTransaction() will get called again while
entry->changing_xact_state is still true. Then the entry will be
caught by the following condition and cleaned up:

/*
* If connection is before starting transaction or is already unsalvageable,
* do only the cleanup and don't touch it further.
*/
if (entry->changing_xact_state)
{
pgfdw_cleanup_after_transaction(entry);
return;
}

>
> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>
> bq. This commits introduces to new background processes: foreign
>
> commits introduces to new -> commit introduces two new

Fixed.

>
> +FdwXactExistsXid(TransactionId xid)
>
> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.

But there is already a function named FdwXactExists()?

bool
FdwXactExists(Oid dbid, Oid serverid, Oid userid)

As far as I read other code, we already have such functions that have
the same functionality but have different arguments. For instance,
SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
we can leave as it is but is it better to have like
FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?

>
> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>
> Please correct year in the next patch set.

Fixed.

>
> +FdwXactLauncherRequestToLaunch(void)
>
> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.

Agreed. How about FdwXactRequestToLaunchResolver()?

>
> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> +Size
> +FdwXactRslvShmemSize(void)
>
> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)

Agreed. I realized that these functions are the launcher's function,
not resolver's. So I'd change to FdwXactLauncherShmemSize() and
FdwXactLauncherShmemInit() respectively.

>
> +fdwxact_launch_resolver(Oid dbid)
>
> The above method is not in camel case. It would be better if method names are consistent (in casing).

Fixed.

>
> + errmsg("out of foreign transaction resolver slots"),
> + errhint("You might need to increase max_foreign_transaction_resolvers.")));
>
> It would be nice to include the value of max_foreign_xact_resolvers

I agree it would be nice but looking at other code we don't include
the value in this kind of messages.

>
> For fdwxact_resolver_onexit():
>
> + LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> + fdwxact->locking_backend = InvalidBackendId;
> + LWLockRelease(FdwXactLock);
>
> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to the for loop and released coming out of the for loop.

Agreed.

>
> +FXRslvLoop(void)
>
> Please use Resolver instead of Rslv

Fixed.

>
> + FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>
> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.

Fixed.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

27 January 2021, 01:29:52


On 2021/01/18 14:54, Masahiko Sawada wrote:
> On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>>
>> For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
>>
>> +   entry->changing_xact_state = true;
>> ...
>> +   entry->changing_xact_state = abort_cleanup_failure;
>>
>> I don't see return statement in between the two assignments. I wonder why entry->changing_xact_state is set to true,
andlater being assigned again.
 
> 
> Because postgresRollbackForeignTransaction() can get called again in
> case where an error occurred during aborting and cleanup the
> transaction. For example, if an error occurred when executing ABORT
> TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
> postgresRollbackForeignTransaction() will get called again while
> entry->changing_xact_state is still true. Then the entry will be
> caught by the following condition and cleaned up:
> 
>      /*
>       * If connection is before starting transaction or is already unsalvageable,
>       * do only the cleanup and don't touch it further.
>       */
>      if (entry->changing_xact_state)
>      {
>          pgfdw_cleanup_after_transaction(entry);
>          return;
>      }
> 
>>
>> For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
>>
>> bq. This commits introduces to new background processes: foreign
>>
>> commits introduces to new -> commit introduces two new
> 
> Fixed.
> 
>>
>> +FdwXactExistsXid(TransactionId xid)
>>
>> Since Xid is the parameter to this method, I think the Xid suffix can be dropped from the method name.
> 
> But there is already a function named FdwXactExists()?
> 
> bool
> FdwXactExists(Oid dbid, Oid serverid, Oid userid)
> 
> As far as I read other code, we already have such functions that have
> the same functionality but have different arguments. For instance,
> SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
> we can leave as it is but is it better to have like
> FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?
> 
>>
>> + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
>>
>> Please correct year in the next patch set.
> 
> Fixed.
> 
>>
>> +FdwXactLauncherRequestToLaunch(void)
>>
>> Since the launcher's job is to 'launch', I think the Launcher can be omitted from the method name.
> 
> Agreed. How about FdwXactRequestToLaunchResolver()?
> 
>>
>> +/* Report shared memory space needed by FdwXactRsoverShmemInit */
>> +Size
>> +FdwXactRslvShmemSize(void)
>>
>> Are both Rsover and Rslv referring to resolver ? It would be better to use whole word which reduces confusion.
>> Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or FdwXactResolveShmemInit)
> 
> Agreed. I realized that these functions are the launcher's function,
> not resolver's. So I'd change to FdwXactLauncherShmemSize() and
> FdwXactLauncherShmemInit() respectively.
> 
>>
>> +fdwxact_launch_resolver(Oid dbid)
>>
>> The above method is not in camel case. It would be better if method names are consistent (in casing).
> 
> Fixed.
> 
>>
>> +                errmsg("out of foreign transaction resolver slots"),
>> +                errhint("You might need to increase max_foreign_transaction_resolvers.")));
>>
>> It would be nice to include the value of max_foreign_xact_resolvers
> 
> I agree it would be nice but looking at other code we don't include
> the value in this kind of messages.
> 
>>
>> For fdwxact_resolver_onexit():
>>
>> +       LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
>> +       fdwxact->locking_backend = InvalidBackendId;
>> +       LWLockRelease(FdwXactLock);
>>
>> There is no call to method inside the for loop which may take time. I wonder if the lock can be obtained prior to
thefor loop and released coming out of the for loop.
 
> 
> Agreed.
> 
>>
>> +FXRslvLoop(void)
>>
>> Please use Resolver instead of Rslv
> 
> Fixed.
> 
>>
>> +           FdwXactResolveFdwXacts(held_fdwxacts, nheld);
>>
>> Fdw and Xact are repeated twice each in the method name. Probably the method name can be made shorter.
> 
> Fixed.

You fixed some issues. But maybe you forgot to attach the latest patches?

I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independent from 2PC
feature.If there are such changes, IMO we can apply them in advance, and which would make the patches simpler.
 

+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not commit transaction on server %s",
+                               frstate->server->servername)));

You changed the code this way because you want to include the server name in the error message? I agree that it's
helpfulto report also the server name that caused an error. OTOH, since this change gets rid of call to
pgfdw_rerport_error()for the returned PGresult, the reported error message contains less information. If this
understandingis right, I don't think that this change is an improvement.
 

Instead, if the server name should be included in the error message, pgfdw_report_error() should be changed so that it
alsoreports the server name? If we do that, the server name is reported not only when COMMIT fails but also when other
commandsfail.
 

Of course, if this change is not essential, we can skip doing this in the first version.

-    /*
-     * Regardless of the event type, we can now mark ourselves as out of the
-     * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-     * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-     */
-    xact_got_connection = false;

With this change, xact_got_connection seems to never be set to false. Doesn't this break pgfdw_subxact_callback() using
xact_got_connection?

+    /* Also reset cursor numbering for next transaction */
+    cursor_number = 0;

Originally this variable is reset to 0 once per transaction end. But with the patch, it's reset to 0 every time when a
foreigntransaction ends at each connection. This change would be harmless fortunately in practice, but seems not right
theoretically.

This makes me wonder if new FDW API is not good at handling the case where some operations need to be performed once
pertransaction end.
 

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

27 January 2021, 05:08:14

On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao
<masao.fujii@oss.nttdata.com> wrote:
>
>
> You fixed some issues. But maybe you forgot to attach the latest patches?

Yes, I've attached the updated patches.

>
> I'm reading 0001 and 0002 patches to pick up the changes for postgres_fdw that worth applying independent from 2PC
feature.If there are such changes, IMO we can apply them in advance, and which would make the patches simpler. 

Thank you for reviewing the patches!

>
> +       if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +               ereport(ERROR, (errmsg("could not commit transaction on server %s",
> +                                                          frstate->server->servername)));
>
> You changed the code this way because you want to include the server name in the error message? I agree that it's
helpfulto report also the server name that caused an error. OTOH, since this change gets rid of call to
pgfdw_rerport_error()for the returned PGresult, the reported error message contains less information. If this
understandingis right, I don't think that this change is an improvement. 

Right. It's better to use do_sql_command() instead.

> Instead, if the server name should be included in the error message, pgfdw_report_error() should be changed so that
italso reports the server name? If we do that, the server name is reported not only when COMMIT fails but also when
othercommands fail. 
>
> Of course, if this change is not essential, we can skip doing this in the first version.

Yes, I think it's not essential for now. We can improve it later if we want.

>
> -       /*
> -        * Regardless of the event type, we can now mark ourselves as out of the
> -        * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
> -        * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
> -        */
> -       xact_got_connection = false;
>
> With this change, xact_got_connection seems to never be set to false. Doesn't this break pgfdw_subxact_callback()
usingxact_got_connection? 

I think xact_got_connection is set to false in
pgfdw_cleanup_after_transaction() that is called at the end of each
foreign transaction (i.g., in postgresCommitForeignTransaction() and
postgresRollbackForeignTransaction()).

But as you're concerned below, it's reset for each foreign transaction
end rather than the parent's transaction end.

>
> +       /* Also reset cursor numbering for next transaction */
> +       cursor_number = 0;
>
> Originally this variable is reset to 0 once per transaction end. But with the patch, it's reset to 0 every time when
aforeign transaction ends at each connection. This change would be harmless fortunately in practice, but seems not
righttheoretically. 
>
> This makes me wonder if new FDW API is not good at handling the case where some operations need to be performed once
pertransaction end. 

I think that the problem comes from the fact that FDW needs to use
both SubXactCallback and new FDW API.

If we want to perform some operations at the end of the top
transaction per FDW, not per foreign transaction, we will either still
need to use XactCallback or need to rethink the FDW API design. But
given that we call commit and rollback FDW API for only foreign
servers that actually started a transaction, I’m not sure if there are
such operations in practice. IIUC there is not at least from the
normal (not-sub) transaction termination perspective.

IIUC xact_got_transaction is used to skip iterating over all cached
connections to find open remote (sub) transactions. This is not
necessary anymore at least from the normal transaction termination
perspective. So maybe we can improve it so that it tracks whether any
of the cached connections opened a subtransaction. That is, we set it
true when we created a savepoint on any connections and set it false
at the end of pgfdw_subxact_callback() if we see that xact_depth of
all cached entry is less than or equal to 1 after iterating over all
entries.

Regarding cursor_number, it essentially needs to be unique at least
within a transaction so we can manage it per transaction or per
connection. But the current postgres_fdw rather ensure uniqueness
across all connections. So it seems to me that this can be fixed by
making individual connection have cursor_number and resetting it in
pgfdw_cleanup_after_transaction(). I think this can be in a separate
patch. Or it also could solve this problem that we terminate
subtransactions via a FDW API but I don't think it's a good idea.

What do you think?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
> Hi,
> For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> With this commit, the foreign server modified within the transaction marked as 'modified'.
>
> transaction marked -> transaction is marked

Will fix.

>
> +#define IsForeignTwophaseCommitRequested() \
> + (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>
> Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.

But even if foreign_twophase_commit is
FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
there is only one modified server, right? It seems the name
IsForeignTwophaseCommitRequested is fine.

>
> +static bool
> +checkForeignTwophaseCommitRequired(bool local_modified)
>
> + if (!ServerSupportTwophaseCommit(fdw_part))
> + have_no_twophase = true;
> ...
> + if (have_no_twophase)
> + ereport(ERROR,
>
> It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
> Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
> have_no_twophase is no longer needed.

Hmm, I think If we process one 2pc-non-capable server first and then
process another one 2pc-capable server, we should raise an error but
cannot detect that.

Then the check would stay as what you have in the patch:

if (!ServerSupportTwophaseCommit(fdw_part))

When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.

have_no_twophase can be dropped.

Thanks

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

03 May 2021, 12:24:35

On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> > Hi,
>> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>>
>> Thank you for reviewing the patch!
>>
>> >
>> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >
>> > transaction marked -> transaction is marked
>>
>> Will fix.
>>
>> >
>> > +#define IsForeignTwophaseCommitRequested() \
>> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >
>> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named:
IsForeignTwophaseCommitRequired.
>>
>> But even if foreign_twophase_commit is
>> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> there is only one modified server, right? It seems the name
>> IsForeignTwophaseCommitRequested is fine.
>>
>> >
>> > +static bool
>> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >
>> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> > +           have_no_twophase = true;
>> > ...
>> > +   if (have_no_twophase)
>> > +       ereport(ERROR,
>> >
>> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other
participant(s).
>> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop
wouldbecome if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
 
>> > have_no_twophase is no longer needed.
>>
>> Hmm, I think If we process one 2pc-non-capable server first and then
>> process another one 2pc-capable server, we should raise an error but
>> cannot detect that.
>
>
> Then the check would stay as what you have in the patch:
>
>   if (!ServerSupportTwophaseCommit(fdw_part))
>
> When the non-2pc-capable server is encountered, we would report the error in place (following the
ServerSupportTwophaseCommitcheck) and come out of the loop.
 
> have_no_twophase can be dropped.

But if we processed only one non-2pc-capable server, we would raise an
error but should not in that case.

On second thought, I think we can track how many servers are modified
or not capable of 2PC during registration and unr-egistration. Then we
can consider both 2PC is required and there is non-2pc-capable server
is involved without looking through all participants. Thoughts?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

03 May 2021, 14:15:21

On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> > Hi,
>> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>>
>> Thank you for reviewing the patch!
>>
>> >
>> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >
>> > transaction marked -> transaction is marked
>>
>> Will fix.
>>
>> >
>> > +#define IsForeignTwophaseCommitRequested() \
>> > + (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >
>> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.
>>
>> But even if foreign_twophase_commit is
>> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> there is only one modified server, right? It seems the name
>> IsForeignTwophaseCommitRequested is fine.
>>
>> >
>> > +static bool
>> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >
>> > + if (!ServerSupportTwophaseCommit(fdw_part))
>> > + have_no_twophase = true;
>> > ...
>> > + if (have_no_twophase)
>> > + ereport(ERROR,
>> >
>> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
>> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
>> > have_no_twophase is no longer needed.
>>
>> Hmm, I think If we process one 2pc-non-capable server first and then
>> process another one 2pc-capable server, we should raise an error but
>> cannot detect that.
>
>
> Then the check would stay as what you have in the patch:
>
> if (!ServerSupportTwophaseCommit(fdw_part))
>
> When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.
> have_no_twophase can be dropped.

But if we processed only one non-2pc-capable server, we would raise an
error but should not in that case.

On second thought, I think we can track how many servers are modified
or not capable of 2PC during registration and unr-egistration. Then we
can consider both 2PC is required and there is non-2pc-capable server
is involved without looking through all participants. Thoughts?

That is something worth trying.

Thanks

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

11 May 2021, 04:37:24

On Mon, May 3, 2021 at 11:11 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> >
>> >
>> > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >> >
>> >> > Hi,
>> >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>> >>
>> >> Thank you for reviewing the patch!
>> >>
>> >> >
>> >> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >> >
>> >> > transaction marked -> transaction is marked
>> >>
>> >> Will fix.
>> >>
>> >> >
>> >> > +#define IsForeignTwophaseCommitRequested() \
>> >> > +    (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >> >
>> >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named:
IsForeignTwophaseCommitRequired.
>> >>
>> >> But even if foreign_twophase_commit is
>> >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> >> there is only one modified server, right? It seems the name
>> >> IsForeignTwophaseCommitRequested is fine.
>> >>
>> >> >
>> >> > +static bool
>> >> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >> >
>> >> > +       if (!ServerSupportTwophaseCommit(fdw_part))
>> >> > +           have_no_twophase = true;
>> >> > ...
>> >> > +   if (have_no_twophase)
>> >> > +       ereport(ERROR,
>> >> >
>> >> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other
participant(s).
>> >> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the
loopwould become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
 
>> >> > have_no_twophase is no longer needed.
>> >>
>> >> Hmm, I think If we process one 2pc-non-capable server first and then
>> >> process another one 2pc-capable server, we should raise an error but
>> >> cannot detect that.
>> >
>> >
>> > Then the check would stay as what you have in the patch:
>> >
>> >   if (!ServerSupportTwophaseCommit(fdw_part))
>> >
>> > When the non-2pc-capable server is encountered, we would report the error in place (following the
ServerSupportTwophaseCommitcheck) and come out of the loop.
 
>> > have_no_twophase can be dropped.
>>
>> But if we processed only one non-2pc-capable server, we would raise an
>> error but should not in that case.
>>
>> On second thought, I think we can track how many servers are modified
>> or not capable of 2PC during registration and unr-egistration. Then we
>> can consider both 2PC is required and there is non-2pc-capable server
>> is involved without looking through all participants. Thoughts?
>
>
> That is something worth trying.
>

I've attached the updated patches that incorporated comments from
Zhihong and Ikeda-san.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

Re: Transactions involving multiple postgres foreign servers, take 2

From

Zhihong Yu

Date:

12 May 2021, 16:44:28

On Mon, May 10, 2021 at 9:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, May 3, 2021 at 11:11 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Mon, May 3, 2021 at 5:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Sun, May 2, 2021 at 1:23 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >
>> >
>> >
>> > On Fri, Apr 30, 2021 at 9:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> On Wed, Mar 17, 2021 at 6:03 PM Zhihong Yu <zyu@yugabyte.com> wrote:
>> >> >
>> >> > Hi,
>> >> > For v35-0007-Prepare-foreign-transactions-at-commit-time.patch :
>> >>
>> >> Thank you for reviewing the patch!
>> >>
>> >> >
>> >> > With this commit, the foreign server modified within the transaction marked as 'modified'.
>> >> >
>> >> > transaction marked -> transaction is marked
>> >>
>> >> Will fix.
>> >>
>> >> >
>> >> > +#define IsForeignTwophaseCommitRequested() \
>> >> > + (foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED)
>> >> >
>> >> > Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired.
>> >>
>> >> But even if foreign_twophase_commit is
>> >> FOREIGN_TWOPHASE_COMMIT_REQUIRED, the two-phase commit is not used if
>> >> there is only one modified server, right? It seems the name
>> >> IsForeignTwophaseCommitRequested is fine.
>> >>
>> >> >
>> >> > +static bool
>> >> > +checkForeignTwophaseCommitRequired(bool local_modified)
>> >> >
>> >> > + if (!ServerSupportTwophaseCommit(fdw_part))
>> >> > + have_no_twophase = true;
>> >> > ...
>> >> > + if (have_no_twophase)
>> >> > + ereport(ERROR,
>> >> >
>> >> > It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s).
>> >> > Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1).
>> >> > have_no_twophase is no longer needed.
>> >>
>> >> Hmm, I think If we process one 2pc-non-capable server first and then
>> >> process another one 2pc-capable server, we should raise an error but
>> >> cannot detect that.
>> >
>> >
>> > Then the check would stay as what you have in the patch:
>> >
>> > if (!ServerSupportTwophaseCommit(fdw_part))
>> >
>> > When the non-2pc-capable server is encountered, we would report the error in place (following the ServerSupportTwophaseCommit check) and come out of the loop.
>> > have_no_twophase can be dropped.
>>
>> But if we processed only one non-2pc-capable server, we would raise an
>> error but should not in that case.
>>
>> On second thought, I think we can track how many servers are modified
>> or not capable of 2PC during registration and unr-egistration. Then we
>> can consider both 2PC is required and there is non-2pc-capable server
>> is involved without looking through all participants. Thoughts?
>
>
> That is something worth trying.
>

I've attached the updated patches that incorporated comments from
Zhihong and Ikeda-san.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Hi,

For v36-0005-Prepare-foreign-transactions-at-commit-time.patch :

With this commit, the foreign server modified within the transaction

marked as 'modified'.

The verb is missing from the above sentence. 'within the transaction marked ' -> within the transaction is marked

+ /* true if modified the data on the server */

modified the data -> data is modified

+ xid = GetTopTransactionIdIfAny();

...

+ if (!TransactionIdIsValid(xid))
+ xid = GetTopTransactionId();

I wonder when the above if condition is true, would the GetTopTransactionId() get valid xid ? It seems the two func calls are the same.

I like the way checkForeignTwophaseCommitRequired() is structured.

Cheers

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

20 May 2021, 04:26:09

On 2021/05/11 13:37, Masahiko Sawada wrote:
> I've attached the updated patches that incorporated comments from
> Zhihong and Ikeda-san.

Thanks for updating the patches!

I have other comments including trivial things.

a. about "foreign_transaction_resolver_timeout" parameter

Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
Is there any reason? Although the following is minor case, it may confuse some
users.

Example case is that

1. a client executes transaction with 2PC when the resolver is processing
FdwXactResolverProcessInDoubtXacts().

2. the resolution of 1st transaction must be waited until the other
transactions for 2pc are executed or timeout.

3. if the client check the 1st result value, it should wait until resolution
is finished for atomic visibility (although it depends on the way how to
realize atomic visibility.) The clients may be waited
foreign_transaction_resolver_timeout". Users may think it's stale.

Like this situation can be observed after testing with pgbench. Some
unresolved transaction remains after benchmarking.

I assume that this default value refers to wal_sender, archiver, and so on.
But, I think this parameter is more like "commit_delay". If so, 60 seconds
seems to be big value.

b. about performance bottleneck (just share my simple benchmark results)

The resolver process can be performance bottleneck easily although I think
some users want this feature even if the performance is not so good.

I tested with very simple workload in my laptop.

The test condition is
* two remote foreign partitions and one transaction inserts an entry in each
partitions.
* local connection only. If NW latency became higher, the performance became
worse.
* pgbench with 8 clients.

The test results is the following. The performance of 2PC is only 10%
performance of the one of without 2PC.

* with foreign_twophase_commit = requried
-> If load with more than 10TPS, the number of unresolved foreign transactions
is increasing and stop with the warning "Increase
max_prepared_foreign_transactions".

* with foreign_twophase_commit = disabled
-> 122TPS in my environments.

c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch

* typo: s/tranasction/transaction/

* Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
(IsInParallelMode())" because make them in the same order as CommitTransaction()?

* functions name of fdwxact.c

Although this depends on my feeling, xact means transaction. If this feeling
same as you, the function names of FdwXactRegisterXact and so on are odd to
me. FdwXactRegisterEntry or FdwXactRegisterParticipant is better?

* Are the following better?

- s/to register the foreign transaction by/to register the foreign transaction
participant by/

- s/The registered foreign transactions/The registered participants/

- s/given foreign transaction/given foreign transaction participant/

- s/Foreign transactions involved in the current transaction/Foreign
transaction participants involved in the current transaction/

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

21 May 2021, 01:39:07

On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
> On 2021/05/11 13:37, Masahiko Sawada wrote:
> > I've attached the updated patches that incorporated comments from
> > Zhihong and Ikeda-san.
>
> Thanks for updating the patches!
>
>
> I have other comments including trivial things.
>
>
> a. about "foreign_transaction_resolver_timeout" parameter
>
> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
> Is there any reason? Although the following is minor case, it may confuse some
> users.
>
> Example case is that
>
> 1. a client executes transaction with 2PC when the resolver is processing
> FdwXactResolverProcessInDoubtXacts().
>
> 2. the resolution of 1st transaction must be waited until the other
> transactions for 2pc are executed or timeout.
>
> 3. if the client check the 1st result value, it should wait until resolution
> is finished for atomic visibility (although it depends on the way how to
> realize atomic visibility.) The clients may be waited
> foreign_transaction_resolver_timeout". Users may think it's stale.
>
> Like this situation can be observed after testing with pgbench. Some
> unresolved transaction remains after benchmarking.
>
> I assume that this default value refers to wal_sender, archiver, and so on.
> But, I think this parameter is more like "commit_delay". If so, 60 seconds
> seems to be big value.

IIUC this situation seems like the foreign transaction resolution is
bottle-neck and doesn’t catch up to incoming resolution requests. But
how foreignt_transaction_resolver_timeout relates to this situation?
foreign_transaction_resolver_timeout controls when to terminate the
resolver process that doesn't have any foreign transactions to
resolve. So if we set it several milliseconds, resolver processes are
terminated immediately after each resolution, imposing the cost of
launching resolver processes on the next resolution.

>
>
> b. about performance bottleneck (just share my simple benchmark results)
>
> The resolver process can be performance bottleneck easily although I think
> some users want this feature even if the performance is not so good.
>
> I tested with very simple workload in my laptop.
>
> The test condition is
> * two remote foreign partitions and one transaction inserts an entry in each
> partitions.
> * local connection only. If NW latency became higher, the performance became
> worse.
> * pgbench with 8 clients.
>
> The test results is the following. The performance of 2PC is only 10%
> performance of the one of without 2PC.
>
> * with foreign_twophase_commit = requried
> -> If load with more than 10TPS, the number of unresolved foreign transactions
> is increasing and stop with the warning "Increase
> max_prepared_foreign_transactions".

What was the value of max_prepared_foreign_transactions?

To speed up the foreign transaction resolution, some ideas have been
discussed. As another idea, how about launching resolvers for each
foreign server? That way, we resolve foreign transactions on each
foreign server in parallel. If foreign transactions are concentrated
on the particular server, we can have multiple resolvers for the one
foreign server. It doesn’t change the fact that all foreign
transaction resolutions are processed by resolver processes.

Apart from that, we also might want to improve foreign transaction
management so that transaction doesn’t end up with an error if the
foreign transaction resolution doesn’t catch up with incoming
transactions that require 2PC. Maybe we can evict and serialize a
state file when FdwXactCtl->xacts[] is full. I’d like to leave it as a
future improvement.

> * with foreign_twophase_commit = disabled
> -> 122TPS in my environments.

How much is the performance without those 2PC patches and with the
same workload? i.e., how fast is the current postgres_fdw that uses
XactCallback?

>
>
> c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch
>
> * typo: s/tranasction/transaction/
>
> * Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
> (IsInParallelMode())" because make them in the same order as CommitTransaction()?

I'd prefer to move AtEOXact_FdwXact() in CommitTransaction after "if
(IsInParallelMode())" since other pre-commit works are done after
cleaning parallel contexts. What do you think?

>
> * functions name of fdwxact.c
>
> Although this depends on my feeling, xact means transaction. If this feeling
> same as you, the function names of FdwXactRegisterXact and so on are odd to
> me. FdwXactRegisterEntry or FdwXactRegisterParticipant is better?
>

FdwXactRegisterEntry sounds good to me. Thanks.

> * Are the following better?
>
> - s/to register the foreign transaction by/to register the foreign transaction
> participant by/
>
> - s/The registered foreign transactions/The registered participants/
>
> - s/given foreign transaction/given foreign transaction participant/
>
> - s/Foreign transactions involved in the current transaction/Foreign
> transaction participants involved in the current transaction/

Agreed with the above suggestions.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

21 May 2021, 03:45:46


On 2021/05/21 10:39, Masahiko Sawada wrote:
> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>>> I've attached the updated patches that incorporated comments from
>>> Zhihong and Ikeda-san.
>>
>> Thanks for updating the patches!
>>
>>
>> I have other comments including trivial things.
>>
>>
>> a. about "foreign_transaction_resolver_timeout" parameter
>>
>> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
>> Is there any reason? Although the following is minor case, it may confuse some
>> users.
>>
>> Example case is that
>>
>> 1. a client executes transaction with 2PC when the resolver is processing
>> FdwXactResolverProcessInDoubtXacts().
>>
>> 2. the resolution of 1st transaction must be waited until the other
>> transactions for 2pc are executed or timeout.
>>
>> 3. if the client check the 1st result value, it should wait until resolution
>> is finished for atomic visibility (although it depends on the way how to
>> realize atomic visibility.) The clients may be waited
>> foreign_transaction_resolver_timeout". Users may think it's stale.
>>
>> Like this situation can be observed after testing with pgbench. Some
>> unresolved transaction remains after benchmarking.
>>
>> I assume that this default value refers to wal_sender, archiver, and so on.
>> But, I think this parameter is more like "commit_delay". If so, 60 seconds
>> seems to be big value.
> 
> IIUC this situation seems like the foreign transaction resolution is
> bottle-neck and doesn’t catch up to incoming resolution requests. But
> how foreignt_transaction_resolver_timeout relates to this situation?
> foreign_transaction_resolver_timeout controls when to terminate the
> resolver process that doesn't have any foreign transactions to
> resolve. So if we set it several milliseconds, resolver processes are
> terminated immediately after each resolution, imposing the cost of
> launching resolver processes on the next resolution.

Thanks for your comments!

No, this situation is not related to the foreign transaction resolution is
bottle-neck or not. This issue may happen when the workload has very few
foreign transactions.

If new foreign transaction comes while the transaction resolver is processing
resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
waits until starting next transaction resolution. If next foreign transaction
doesn't come, the foreign transaction must wait starting resolution until
timeout. I mentioned this situation.

Thanks for letting me know the side effect if setting resolution timeout to
several milliseconds. I agree. But, why termination is needed? Is there a
possibility to stale like walsender?


>>
>>
>> b. about performance bottleneck (just share my simple benchmark results)
>>
>> The resolver process can be performance bottleneck easily although I think
>> some users want this feature even if the performance is not so good.
>>
>> I tested with very simple workload in my laptop.
>>
>> The test condition is
>> * two remote foreign partitions and one transaction inserts an entry in each
>> partitions.
>> * local connection only. If NW latency became higher, the performance became
>> worse.
>> * pgbench with 8 clients.
>>
>> The test results is the following. The performance of 2PC is only 10%
>> performance of the one of without 2PC.
>>
>> * with foreign_twophase_commit = requried
>> -> If load with more than 10TPS, the number of unresolved foreign transactions
>> is increasing and stop with the warning "Increase
>> max_prepared_foreign_transactions".
> 
> What was the value of max_prepared_foreign_transactions?

Now, I tested with 200.

If each resolution is finished very soon, I thought it's enough because
8clients x 2partitions = 16, though... But, it's difficult how to know the
stable values.


> To speed up the foreign transaction resolution, some ideas have been
> discussed. As another idea, how about launching resolvers for each
> foreign server? That way, we resolve foreign transactions on each
> foreign server in parallel. If foreign transactions are concentrated
> on the particular server, we can have multiple resolvers for the one
> foreign server. It doesn’t change the fact that all foreign
> transaction resolutions are processed by resolver processes.

Awesome! There seems to be another pros that even if a foreign server is
temporarily busy or stopped due to fail over, other foreign server's
transactions can be resolved.



> Apart from that, we also might want to improve foreign transaction
> management so that transaction doesn’t end up with an error if the
> foreign transaction resolution doesn’t catch up with incoming
> transactions that require 2PC. Maybe we can evict and serialize a
> state file when FdwXactCtl->xacts[] is full. I’d like to leave it as a
> future improvement.

Oh, great! I didn't come up with the idea.

Although I thought the feature makes difficult to know the foreign transaction
is resolved stably, DBAs can check "pg_foreign_xacts" view now and it's enough
to output the situation of foreign transactions are spilled to the log.


>> * with foreign_twophase_commit = disabled
>> -> 122TPS in my environments.
> 
> How much is the performance without those 2PC patches and with the
> same workload? i.e., how fast is the current postgres_fdw that uses
> XactCallback?

OK, I'll test.


>> c. v36-0001-Introduce-transaction-manager-for-foreign-transa.patch
>>
>> * typo: s/tranasction/transaction/
>>
>> * Is it better to move AtEOXact_FdwXact() in AbortTransaction() to before "if
>> (IsInParallelMode())" because make them in the same order as CommitTransaction()?
> 
> I'd prefer to move AtEOXact_FdwXact() in CommitTransaction after "if
> (IsInParallelMode())" since other pre-commit works are done after
> cleaning parallel contexts. What do you think?

OK, I agree.


Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

21 May 2021, 04:45:20

On Fri, May 21, 2021 at 12:45 PM Masahiro Ikeda
<ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/21 10:39, Masahiko Sawada wrote:
> > On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >>
> >>
> >> On 2021/05/11 13:37, Masahiko Sawada wrote:
> >>> I've attached the updated patches that incorporated comments from
> >>> Zhihong and Ikeda-san.
> >>
> >> Thanks for updating the patches!
> >>
> >>
> >> I have other comments including trivial things.
> >>
> >>
> >> a. about "foreign_transaction_resolver_timeout" parameter
> >>
> >> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
> >> Is there any reason? Although the following is minor case, it may confuse some
> >> users.
> >>
> >> Example case is that
> >>
> >> 1. a client executes transaction with 2PC when the resolver is processing
> >> FdwXactResolverProcessInDoubtXacts().
> >>
> >> 2. the resolution of 1st transaction must be waited until the other
> >> transactions for 2pc are executed or timeout.
> >>
> >> 3. if the client check the 1st result value, it should wait until resolution
> >> is finished for atomic visibility (although it depends on the way how to
> >> realize atomic visibility.) The clients may be waited
> >> foreign_transaction_resolver_timeout". Users may think it's stale.
> >>
> >> Like this situation can be observed after testing with pgbench. Some
> >> unresolved transaction remains after benchmarking.
> >>
> >> I assume that this default value refers to wal_sender, archiver, and so on.
> >> But, I think this parameter is more like "commit_delay". If so, 60 seconds
> >> seems to be big value.
> >
> > IIUC this situation seems like the foreign transaction resolution is
> > bottle-neck and doesn’t catch up to incoming resolution requests. But
> > how foreignt_transaction_resolver_timeout relates to this situation?
> > foreign_transaction_resolver_timeout controls when to terminate the
> > resolver process that doesn't have any foreign transactions to
> > resolve. So if we set it several milliseconds, resolver processes are
> > terminated immediately after each resolution, imposing the cost of
> > launching resolver processes on the next resolution.
>
> Thanks for your comments!
>
> No, this situation is not related to the foreign transaction resolution is
> bottle-neck or not. This issue may happen when the workload has very few
> foreign transactions.
>
> If new foreign transaction comes while the transaction resolver is processing
> resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
> waits until starting next transaction resolution. If next foreign transaction
> doesn't come, the foreign transaction must wait starting resolution until
> timeout. I mentioned this situation.

Thanks for your explanation. I think that in this case we should set
the latch of the resolver after preparing all foreign transactions so
that the resolver process those transactions without sleep.

>
> Thanks for letting me know the side effect if setting resolution timeout to
> several milliseconds. I agree. But, why termination is needed? Is there a
> possibility to stale like walsender?

The purpose of this timeout is to terminate resolvers that are idle
for a long time. The resolver processes don't necessarily need to keep
running all the time for every database. On the other hand, launching
a resolver process per commit would be a high cost. So we have
resolver processes keep running at least for
foreign_transaction_resolver_timeout.

>
>
> >>
> >>
> >> b. about performance bottleneck (just share my simple benchmark results)
> >>
> >> The resolver process can be performance bottleneck easily although I think
> >> some users want this feature even if the performance is not so good.
> >>
> >> I tested with very simple workload in my laptop.
> >>
> >> The test condition is
> >> * two remote foreign partitions and one transaction inserts an entry in each
> >> partitions.
> >> * local connection only. If NW latency became higher, the performance became
> >> worse.
> >> * pgbench with 8 clients.
> >>
> >> The test results is the following. The performance of 2PC is only 10%
> >> performance of the one of without 2PC.
> >>
> >> * with foreign_twophase_commit = requried
> >> -> If load with more than 10TPS, the number of unresolved foreign transactions
> >> is increasing and stop with the warning "Increase
> >> max_prepared_foreign_transactions".
> >
> > What was the value of max_prepared_foreign_transactions?
>
> Now, I tested with 200.
>
> If each resolution is finished very soon, I thought it's enough because
> 8clients x 2partitions = 16, though... But, it's difficult how to know the
> stable values.

During resolving one distributed transaction, the resolver needs both
one round trip and fsync-ing WAL record for each foreign transaction.
Since the client doesn’t wait for the distributed transaction to be
resolved, the resolver process can be easily bottle-neck given there
are 8 clients.

If foreign transaction resolution was resolved synchronously, 16 would suffice.

>
>
> > To speed up the foreign transaction resolution, some ideas have been
> > discussed. As another idea, how about launching resolvers for each
> > foreign server? That way, we resolve foreign transactions on each
> > foreign server in parallel. If foreign transactions are concentrated
> > on the particular server, we can have multiple resolvers for the one
> > foreign server. It doesn’t change the fact that all foreign
> > transaction resolutions are processed by resolver processes.
>
> Awesome! There seems to be another pros that even if a foreign server is
> temporarily busy or stopped due to fail over, other foreign server's
> transactions can be resolved.

Yes. We also might need to be careful about the order of foreign
transaction resolution. I think we need to resolve foreign
transactions in arrival order at least within a foreign server.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

21 May 2021, 08:48:08


On 2021/05/21 13:45, Masahiko Sawada wrote:
> On Fri, May 21, 2021 at 12:45 PM Masahiro Ikeda
> <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2021/05/21 10:39, Masahiko Sawada wrote:
>>> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>>>
>>>>
>>>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>>>>> I've attached the updated patches that incorporated comments from
>>>>> Zhihong and Ikeda-san.
>>>>
>>>> Thanks for updating the patches!
>>>>
>>>>
>>>> I have other comments including trivial things.
>>>>
>>>>
>>>> a. about "foreign_transaction_resolver_timeout" parameter
>>>>
>>>> Now, the default value of "foreign_transaction_resolver_timeout" is 60 secs.
>>>> Is there any reason? Although the following is minor case, it may confuse some
>>>> users.
>>>>
>>>> Example case is that
>>>>
>>>> 1. a client executes transaction with 2PC when the resolver is processing
>>>> FdwXactResolverProcessInDoubtXacts().
>>>>
>>>> 2. the resolution of 1st transaction must be waited until the other
>>>> transactions for 2pc are executed or timeout.
>>>>
>>>> 3. if the client check the 1st result value, it should wait until resolution
>>>> is finished for atomic visibility (although it depends on the way how to
>>>> realize atomic visibility.) The clients may be waited
>>>> foreign_transaction_resolver_timeout". Users may think it's stale.
>>>>
>>>> Like this situation can be observed after testing with pgbench. Some
>>>> unresolved transaction remains after benchmarking.
>>>>
>>>> I assume that this default value refers to wal_sender, archiver, and so on.
>>>> But, I think this parameter is more like "commit_delay". If so, 60 seconds
>>>> seems to be big value.
>>>
>>> IIUC this situation seems like the foreign transaction resolution is
>>> bottle-neck and doesn’t catch up to incoming resolution requests. But
>>> how foreignt_transaction_resolver_timeout relates to this situation?
>>> foreign_transaction_resolver_timeout controls when to terminate the
>>> resolver process that doesn't have any foreign transactions to
>>> resolve. So if we set it several milliseconds, resolver processes are
>>> terminated immediately after each resolution, imposing the cost of
>>> launching resolver processes on the next resolution.
>>
>> Thanks for your comments!
>>
>> No, this situation is not related to the foreign transaction resolution is
>> bottle-neck or not. This issue may happen when the workload has very few
>> foreign transactions.
>>
>> If new foreign transaction comes while the transaction resolver is processing
>> resolutions via FdwXactResolverProcessInDoubtXacts(), the foreign transaction
>> waits until starting next transaction resolution. If next foreign transaction
>> doesn't come, the foreign transaction must wait starting resolution until
>> timeout. I mentioned this situation.
> 
> Thanks for your explanation. I think that in this case we should set
> the latch of the resolver after preparing all foreign transactions so
> that the resolver process those transactions without sleep.

Yes, your idea is much better. Thanks!


>>
>> Thanks for letting me know the side effect if setting resolution timeout to
>> several milliseconds. I agree. But, why termination is needed? Is there a
>> possibility to stale like walsender?
> 
> The purpose of this timeout is to terminate resolvers that are idle
> for a long time. The resolver processes don't necessarily need to keep
> running all the time for every database. On the other hand, launching
> a resolver process per commit would be a high cost. So we have
> resolver processes keep running at least for
> foreign_transaction_resolver_timeout.
Understood. I think it's reasonable.


>>>>
>>>>
>>>> b. about performance bottleneck (just share my simple benchmark results)
>>>>
>>>> The resolver process can be performance bottleneck easily although I think
>>>> some users want this feature even if the performance is not so good.
>>>>
>>>> I tested with very simple workload in my laptop.
>>>>
>>>> The test condition is
>>>> * two remote foreign partitions and one transaction inserts an entry in each
>>>> partitions.
>>>> * local connection only. If NW latency became higher, the performance became
>>>> worse.
>>>> * pgbench with 8 clients.
>>>>
>>>> The test results is the following. The performance of 2PC is only 10%
>>>> performance of the one of without 2PC.
>>>>
>>>> * with foreign_twophase_commit = requried
>>>> -> If load with more than 10TPS, the number of unresolved foreign transactions
>>>> is increasing and stop with the warning "Increase
>>>> max_prepared_foreign_transactions".
>>>
>>> What was the value of max_prepared_foreign_transactions?
>>
>> Now, I tested with 200.
>>
>> If each resolution is finished very soon, I thought it's enough because
>> 8clients x 2partitions = 16, though... But, it's difficult how to know the
>> stable values.
> 
> During resolving one distributed transaction, the resolver needs both
> one round trip and fsync-ing WAL record for each foreign transaction.
> Since the client doesn’t wait for the distributed transaction to be
> resolved, the resolver process can be easily bottle-neck given there
> are 8 clients.
> 
> If foreign transaction resolution was resolved synchronously, 16 would suffice.

OK, thanks.


>>
>>
>>> To speed up the foreign transaction resolution, some ideas have been
>>> discussed. As another idea, how about launching resolvers for each
>>> foreign server? That way, we resolve foreign transactions on each
>>> foreign server in parallel. If foreign transactions are concentrated
>>> on the particular server, we can have multiple resolvers for the one
>>> foreign server. It doesn’t change the fact that all foreign
>>> transaction resolutions are processed by resolver processes.
>>
>> Awesome! There seems to be another pros that even if a foreign server is
>> temporarily busy or stopped due to fail over, other foreign server's
>> transactions can be resolved.
> 
> Yes. We also might need to be careful about the order of foreign
> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.

I agree it's better.

(Although this is my interest...)
Is it necessary? Although this idea seems to be for atomic visibility,
2PC can't realize that as you know. So, I wondered that.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

25 May 2021, 12:59:07

On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/21 13:45, Masahiko Sawada wrote:
> >
> > Yes. We also might need to be careful about the order of foreign
> > transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.
>
> I agree it's better.
>
> (Although this is my interest...)
> Is it necessary? Although this idea seems to be for atomic visibility,
> 2PC can't realize that as you know. So, I wondered that.

I think it's for fairness. If a foreign transaction arrived earlier
gets put off so often for other foreign transactions arrived later due
to its index in FdwXactCtl->xacts, it’s not understandable for users
and not fair. I think it’s better to handle foreign transactions in
FIFO manner (although this problem exists even in the current code).

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

03 June 2021, 04:56:49

On 2021/05/25 21:59, Masahiko Sawada wrote:
> On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>>
>> On 2021/05/21 13:45, Masahiko Sawada wrote:
>>>
>>> Yes. We also might need to be careful about the order of foreign
>>> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign
server.
>>
>> I agree it's better.
>>
>> (Although this is my interest...)
>> Is it necessary? Although this idea seems to be for atomic visibility,
>> 2PC can't realize that as you know. So, I wondered that.
> 
> I think it's for fairness. If a foreign transaction arrived earlier
> gets put off so often for other foreign transactions arrived later due
> to its index in FdwXactCtl->xacts, it’s not understandable for users
> and not fair. I think it’s better to handle foreign transactions in
> FIFO manner (although this problem exists even in the current code).

OK, thanks.

On 2021/05/21 12:45, Masahiro Ikeda wrote:
> On 2021/05/21 10:39, Masahiko Sawada wrote:
>> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:
>> How much is the performance without those 2PC patches and with the
>> same workload? i.e., how fast is the current postgres_fdw that uses
>> XactCallback?
>
> OK, I'll test.

The test results are followings. But, I couldn't confirm the performance
improvements of 2PC patches though I may need to be changed the test condition.

[condition]
* 1 coordinator and 3 foreign servers
* There are two custom scripts which access different two foreign servers per
transaction

``` fxact_select.pgbench
BEGIN;
SELECT * FROM part:p1 WHERE id = :id;
SELECT * FROM part:p2 WHERE id = :id;
COMMIT;
```

``` fxact_update.pgbench
BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

[results]

I have tested three times.
Performance difference seems to be within the range of errors.

# 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
- fxact_update.pgbench
72.3, 74.9, 77.5  TPS  => avg 74.9 TPS
110.5, 106.8, 103.2  ms => avg 106.8 ms

- fxact_select.pgbench
1767.6, 1737.1, 1717.4 TPS  => avg 1740.7 TPS
4.5, 4.6, 4.7 ms => avg 4.6ms

# 6d0eb38557 without 2pc patches
- fxact_update.pgbench
76.5, 70.6, 69.5 TPS => avg 72.2 TPS
104.534 + 113.244 + 115.097 => avg 111.0 ms

-fxact_select.pgbench
1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
4.2, 4.6, 4.6 ms=>  4.5 ms

# About the bottleneck of the resolver process

I investigated the performance bottleneck of the resolver process using perf.
The main bottleneck is the following functions.

1st. 42.8% routine->CommitForeignTransaction()
2nd. 31.5% remove_fdwxact()
3rd. 10.16% CommitTransaction()

1st and 3rd problems can be solved by parallelizing resolver processes per
remote servers. But, I wondered that the idea, which backends call also
"COMMIT/ABORT PREPARED" and the resolver process only takes changes of
resolving in-doubt foreign transactions, is better. In many cases, I think
that the number of connections is much greater than the number of remote
servers. If so, the parallelization is not enough.

So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
the extension. So, this idea is not bad.

Although resolving asynchronously has the performance benefit, we can't take
advantage because the resolver process can be bottleneck easily now.

2nd remove_fdwxact() syncs the WAL, which indicates the foreign transaction
entry is removed. Is it necessary to sync momentarily?

To remove syncing leads the time of recovery phase may be longer because some
fdxact entries need to "COMMIT/ABORT PREPARED" again. But I think the effect
is limited.

# About other trivial comments.

* Is it better to call pgstat_send_wal() in the resolver process?

* Is it better to specify that only one resolver process can be launched in on
database on the descrpition of "max_foreign_transaction_resolvers"?

* Is it intentional that removing and inserting new lines in foreigncmds.c?

* Is it better that "max_prepared_foreign_transactions=%d" is after
"max_prepared_xacts=%d" in xlogdesc.c?

* Is "fdwxact_queue" unnecessary now?

* Is the following " + sizeof(FdwXactResolver)" unnecessary?

#define SizeOfFdwXactResolverCtlData \
    (offsetof(FdwXactResolverCtlData, resolvers) + sizeof(FdwXactResolver))

Although MultiXactStateData considered the backendIds start from 1 indexed,
the resolvers start from 0 indexed. Sorry, if my understanding is wrong.

* s/transaciton/transaction/

* s/foreign_xact_resolution_retry_interval since last
resolver/foreign_xact_resolution_retry_interval since last resolver was/

* Don't we need the debug log in the following in postgres.c like logical
launcher shutdown?

    else if (IsFdwXactLauncher())
    {
        /*
        * The foreign transaction launcher can be stopped at any time.
        * Use exit status 1 so the background worker is restarted.
        */
        proc_exit(1);
    }

* Is pg_stop_foreign_xact_resolver(PG_FUNCTION_ARGS) not documented?

* Is it better from "when arrived a requested by backend process." to
"when a request by backend process is arrived."?

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

04 June 2021, 03:28:46

On Thu, Jun 3, 2021 at 1:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/25 21:59, Masahiko Sawada wrote:
> > On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
> >>
> >> On 2021/05/21 13:45, Masahiko Sawada wrote:
> >>>
> >>> Yes. We also might need to be careful about the order of foreign
> >>> transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a
foreignserver. 
> >>
> >> I agree it's better.
> >>
> >> (Although this is my interest...)
> >> Is it necessary? Although this idea seems to be for atomic visibility,
> >> 2PC can't realize that as you know. So, I wondered that.
> >
> > I think it's for fairness. If a foreign transaction arrived earlier
> > gets put off so often for other foreign transactions arrived later due
> > to its index in FdwXactCtl->xacts, it’s not understandable for users
> > and not fair. I think it’s better to handle foreign transactions in
> > FIFO manner (although this problem exists even in the current code).
>
> OK, thanks.
>
>
> On 2021/05/21 12:45, Masahiro Ikeda wrote:
> > On 2021/05/21 10:39, Masahiko Sawada wrote:
> >> On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
> wrote:
> >> How much is the performance without those 2PC patches and with the
> >> same workload? i.e., how fast is the current postgres_fdw that uses
> >> XactCallback?
> >
> > OK, I'll test.
>
> The test results are followings. But, I couldn't confirm the performance
> improvements of 2PC patches though I may need to be changed the test condition.
>
> [condition]
> * 1 coordinator and 3 foreign servers
> * There are two custom scripts which access different two foreign servers per
> transaction
>
> ``` fxact_select.pgbench
> BEGIN;
> SELECT * FROM part:p1 WHERE id = :id;
> SELECT * FROM part:p2 WHERE id = :id;
> COMMIT;
> ```
>
> ``` fxact_update.pgbench
> BEGIN;
> UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
> COMMIT;
> ```
>
> [results]
>
> I have tested three times.
> Performance difference seems to be within the range of errors.
>
> # 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
> - fxact_update.pgbench
> 72.3, 74.9, 77.5  TPS  => avg 74.9 TPS
> 110.5, 106.8, 103.2  ms => avg 106.8 ms
>
> - fxact_select.pgbench
> 1767.6, 1737.1, 1717.4 TPS  => avg 1740.7 TPS
> 4.5, 4.6, 4.7 ms => avg 4.6ms
>
> # 6d0eb38557 without 2pc patches
> - fxact_update.pgbench
> 76.5, 70.6, 69.5 TPS => avg 72.2 TPS
> 104.534 + 113.244 + 115.097 => avg 111.0 ms
>
> -fxact_select.pgbench
> 1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
> 4.2, 4.6, 4.6 ms=>  4.5 ms
>

Thank you for testing!

I think the result shows that managing foreign transactions on the
core side would not be a problem in terms of performance.

>
>
>
>
> # About the bottleneck of the resolver process
>
> I investigated the performance bottleneck of the resolver process using perf.
> The main bottleneck is the following functions.
>
> 1st. 42.8% routine->CommitForeignTransaction()
> 2nd. 31.5% remove_fdwxact()
> 3rd. 10.16% CommitTransaction()
>
> 1st and 3rd problems can be solved by parallelizing resolver processes per
> remote servers. But, I wondered that the idea, which backends call also
> "COMMIT/ABORT PREPARED" and the resolver process only takes changes of
> resolving in-doubt foreign transactions, is better. In many cases, I think
> that the number of connections is much greater than the number of remote
> servers. If so, the parallelization is not enough.
>
> So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
> better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
> the extension. So, this idea is not bad.

Thank you for pointing it out. This idea has been proposed several
times and there were discussions. I'd like to summarize the proposed
ideas and those pros and cons before replying to your other comments.

There are 3 ideas. After backend both prepares all foreign transaction
and commit the local transaction,

1. the backend continues attempting to commit all prepared foreign
transactions until all of them are committed.
2. the backend attempts to commit all prepared foreign transactions
once. If an error happens, leave them for the resolver.
3. the backend asks the resolver that launched per foreign server to
commit the prepared foreign transactions (and backend waits or doesn't
wait for the commit completion depending on the setting).

With ideas 1 and 2, since the backend itself commits all foreign
transactions the resolver process cannot be a bottleneck, and probably
the code can get more simple as backends don't need to communicate
with resolver processes.

However, those have two problems we need to deal with:

First, users could get an error if an error happens during the backend
committing prepared foreign transaction but the local transaction is
already committed and some foreign transactions could also be
committed, confusing users. There were two opinions to this problem:
FDW developers should be responsible for writing FDW code such that
any error doesn't happen during committing foreign transactions, and
users can accept that confusion since an error could happen after
writing the commit WAL even today without this 2PC feature. For the
former point, I'm not sure it's always doable since even palloc()
could raise an error and it seems hard to require all FDW developers
to understand all possible paths of raising an error. And for the
latter point, that's true but I think those cases are
should-not-happen cases (i.g., rare cases) whereas the likelihood of
an error during committing prepared transactions is not low (e.g., by
network connectivity problem). I think we need to assume that that is
not a rare case.

The second problem is whether we can cancel committing foreign
transactions by pg_cancel_backend() (or pressing Ctl-c). If the
backend process commits prepared foreign transactions, it's FDW
developers' responsibility to write code that is interruptible. I’m
not sure it’s feasible for drivers for other databases.

Idea 3 is proposed to deal with those problems. By having separate
processes, resolver processes, committing prepared foreign
transactions, we and FDW developers don't need to worry about those
two problems.

However as Ikeda-san shared the performance results, idea 3 is likely
to have a performance problem since resolver processes can easily be
bottle-neck. Moreover, with the current patch, since we asynchronously
commit foreign prepared transactions, if many concurrent clients use
2PC, reaching max_foreign_prepared_transactions,  transactions end up
with an error.

Through the long discussion on this thread, I've been thought we got a
consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for
dealing with the performance problem. Idea 1 and 2 are also good and
attractive, but I think we need to deal with the two problems first if
we go with one of those ideas. To be honest, I'm really not sure it's
good if we make those things FDW developers responsibility.

As long as we commit foreign prepared transactions asynchronously and
there is max_foreign_prepared_transactions limit, it's possible that
committing those transactions could not keep up. Maybe the same is
true for a case where the client heavily uses 2PC and asynchronously
commits prepared transactions. If committing prepared transactions
doesn't keep up with preparing transactions, the system reaches
max_prepared_transactions.

With the current patch, we commit prepared foreign transactions
asynchronously. But maybe we need to compare the performance of ideas
1 (and 2) to idea 3 with synchronous foreign transaction resolution.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

"ikedamsh@oss.nttdata.com"

Date:

04 June 2021, 06:58:47

2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:

On Thu, Jun 3, 2021 at 1:56 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021/05/25 21:59, Masahiko Sawada wrote:
On Fri, May 21, 2021 at 5:48 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:

On 2021/05/21 13:45, Masahiko Sawada wrote:

Yes. We also might need to be careful about the order of foreign
transaction resolution. I think we need to resolve foreign> transactions in arrival order at least within a foreign server.

I agree it's better.

(Although this is my interest...)
Is it necessary? Although this idea seems to be for atomic visibility,
2PC can't realize that as you know. So, I wondered that.

I think it's for fairness. If a foreign transaction arrived earlier
gets put off so often for other foreign transactions arrived later due
to its index in FdwXactCtl->xacts, it’s not understandable for users
and not fair. I think it’s better to handle foreign transactions in
FIFO manner (although this problem exists even in the current code).

OK, thanks.

On 2021/05/21 12:45, Masahiro Ikeda wrote:
On 2021/05/21 10:39, Masahiko Sawada wrote:
On Thu, May 20, 2021 at 1:26 PM Masahiro Ikeda <ikedamsh@oss.nttdata.com>
wrote:
How much is the performance without those 2PC patches and with the
same workload? i.e., how fast is the current postgres_fdw that uses
XactCallback?

OK, I'll test.

The test results are followings. But, I couldn't confirm the performance
improvements of 2PC patches though I may need to be changed the test condition.

[condition]
* 1 coordinator and 3 foreign servers
* There are two custom scripts which access different two foreign servers per
transaction

``` fxact_select.pgbench
BEGIN;
SELECT * FROM part:p1 WHERE id = :id;
SELECT * FROM part:p2 WHERE id = :id;
COMMIT;
```

``` fxact_update.pgbench
BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

[results]

I have tested three times.
Performance difference seems to be within the range of errors.

# 6d0eb38557 with 2pc patches(v36) and foreign_twophase_commit = disable
- fxact_update.pgbench
72.3, 74.9, 77.5 TPS => avg 74.9 TPS
110.5, 106.8, 103.2 ms => avg 106.8 ms

- fxact_select.pgbench
1767.6, 1737.1, 1717.4 TPS => avg 1740.7 TPS
4.5, 4.6, 4.7 ms => avg 4.6ms

# 6d0eb38557 without 2pc patches
- fxact_update.pgbench
76.5, 70.6, 69.5 TPS => avg 72.2 TPS
104.534 + 113.244 + 115.097 => avg 111.0 ms

-fxact_select.pgbench
1810.2, 1748.3, 1737.2 TPS => avg 1765.2 TPS
4.2, 4.6, 4.6 ms=> 4.5 ms

Thank you for testing!

I think the result shows that managing foreign transactions on the
core side would not be a problem in terms of performance.

# About the bottleneck of the resolver process

I investigated the performance bottleneck of the resolver process using perf.
The main bottleneck is the following functions.

1st. 42.8% routine->CommitForeignTransaction()
2nd. 31.5% remove_fdwxact()
3rd. 10.16% CommitTransaction()

1st and 3rd problems can be solved by parallelizing resolver processes per
remote servers. But, I wondered that the idea, which backends call also
"COMMIT/ABORT PREPARED" and the resolver process only takes changes of
resolving in-doubt foreign transactions, is better. In many cases, I think
that the number of connections is much greater than the number of remote
servers. If so, the parallelization is not enough.

So, I think the idea which backends execute "PREPARED COMMIT" synchronously is
better. The citus has the 2PC feature and backends send "PREPARED COMMIT" in
the extension. So, this idea is not bad.

Thank you for pointing it out. This idea has been proposed several
times and there were discussions. I'd like to summarize the proposed
ideas and those pros and cons before replying to your other comments.

There are 3 ideas. After backend both prepares all foreign transaction
and commit the local transaction,

1. the backend continues attempting to commit all prepared foreign
transactions until all of them are committed.
2. the backend attempts to commit all prepared foreign transactions
once. If an error happens, leave them for the resolver.
3. the backend asks the resolver that launched per foreign server to
commit the prepared foreign transactions (and backend waits or doesn't
wait for the commit completion depending on the setting).

With ideas 1 and 2, since the backend itself commits all foreign
transactions the resolver process cannot be a bottleneck, and probably
the code can get more simple as backends don't need to communicate
with resolver processes.

However, those have two problems we need to deal with:

Thanks for sharing the summarize. I understood there are problems related to

FDW implementation.

First, users could get an error if an error happens during the backend
committing prepared foreign transaction but the local transaction is
already committed and some foreign transactions could also be
committed, confusing users. There were two opinions to this problem:
FDW developers should be responsible for writing FDW code such that
any error doesn't happen during committing foreign transactions, and
users can accept that confusion since an error could happen after
writing the commit WAL even today without this 2PC feature. For the
former point, I'm not sure it's always doable since even palloc()
could raise an error and it seems hard to require all FDW developers
to understand all possible paths of raising an error. And for the
latter point, that's true but I think those cases are
should-not-happen cases (i.g., rare cases) whereas the likelihood of
an error during committing prepared transactions is not low (e.g., by
network connectivity problem). I think we need to assume that that is
not a rare case.

Hmm… Sorry, I don’t have any good ideas now.

If anything, I’m on second side which users accept the confusion though

let users know a error happens before local commit is done or not is necessary

because if the former case, users will execute the same query again.

The second problem is whether we can cancel committing foreign
transactions by pg_cancel_backend() (or pressing Ctl-c). If the
backend process commits prepared foreign transactions, it's FDW
developers' responsibility to write code that is interruptible. I’m
not sure it’s feasible for drivers for other databases.

Sorry, my understanding is not clear.

After all prepares are done, the foreign transactions will be committed.

So, does this mean that FDW must leave the unresolved transaction to the transaction

resolver and show some messages like “Since the transaction is already committed,

the transaction will be resolved in background" ?

Idea 3 is proposed to deal with those problems. By having separate
processes, resolver processes, committing prepared foreign
transactions, we and FDW developers don't need to worry about those
two problems.

However as Ikeda-san shared the performance results, idea 3 is likely
to have a performance problem since resolver processes can easily be
bottle-neck. Moreover, with the current patch, since we asynchronously
commit foreign prepared transactions, if many concurrent clients use
2PC, reaching max_foreign_prepared_transactions, transactions end up
with an error.

Through the long discussion on this thread, I've been thought we got a
consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for
dealing with the performance problem. Idea 1 and 2 are also good and
attractive, but I think we need to deal with the two problems first if
we go with one of those ideas. To be honest, I'm really not sure it's
good if we make those things FDW developers responsibility.

As long as we commit foreign prepared transactions asynchronously and
there is max_foreign_prepared_transactions limit, it's possible that
committing those transactions could not keep up. Maybe the same is
true for a case where the client heavily uses 2PC and asynchronously
commits prepared transactions. If committing prepared transactions
doesn't keep up with preparing transactions, the system reaches
max_prepared_transactions.

With the current patch, we commit prepared foreign transactions
asynchronously. But maybe we need to compare the performance of ideas
1 (and 2) to idea 3 with synchronous foreign transaction resolution.

OK, I understood the consensus is 3rd one. I agree it since I don’t have any solutions

For the problems related 1st and 2nd. If I find them, I’ll share you.

Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

04 June 2021, 08:04:27

From: Masahiko Sawada <sawada.mshk@gmail.com>
1. the backend continues attempting to commit all prepared foreign
> transactions until all of them are committed.
> 2. the backend attempts to commit all prepared foreign transactions
> once. If an error happens, leave them for the resolver.
> 3. the backend asks the resolver that launched per foreign server to
> commit the prepared foreign transactions (and backend waits or doesn't
> wait for the commit completion depending on the setting).
> 
> With ideas 1 and 2, since the backend itself commits all foreign
> transactions the resolver process cannot be a bottleneck, and probably
> the code can get more simple as backends don't need to communicate
> with resolver processes.
> 
> However, those have two problems we need to deal with:
> 

> First, users could get an error if an error happens during the backend
> committing prepared foreign transaction but the local transaction is
> already committed and some foreign transactions could also be
> committed, confusing users. There were two opinions to this problem:
> FDW developers should be responsible for writing FDW code such that
> any error doesn't happen during committing foreign transactions, and
> users can accept that confusion since an error could happen after
> writing the commit WAL even today without this 2PC feature. 

Why does the user have to get an error?  Once the local transaction has been prepared, which means all remote ones also
havebeen prepared, the whole transaction is determined to commit.  So, the user doesn't have to receive an error as
longas the local node is alive.
 


> For the
> former point, I'm not sure it's always doable since even palloc()
> could raise an error and it seems hard to require all FDW developers
> to understand all possible paths of raising an error.

No, this is a matter of discipline to ensure consistency, just in case we really have to return an error to the user.


> And for the
> latter point, that's true but I think those cases are
> should-not-happen cases (i.g., rare cases) whereas the likelihood of
> an error during committing prepared transactions is not low (e.g., by
> network connectivity problem). I think we need to assume that that is
> not a rare case.

How do non-2PC and 2PC cases differ in the rarity of the error?


> The second problem is whether we can cancel committing foreign
> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> backend process commits prepared foreign transactions, it's FDW
> developers' responsibility to write code that is interruptible. I’m
> not sure it’s feasible for drivers for other databases.

That's true not only for prepare and commit but also for other queries.  Why do we have to treat prepare and commit
specially?


> Through the long discussion on this thread, I've been thought we got a
> consensus on idea 3 but sometimes ideas 1 and 2 are proposed again for

I don't remember seeing any consensus yet?

> With the current patch, we commit prepared foreign transactions
> asynchronously. But maybe we need to compare the performance of ideas
> 1 (and 2) to idea 3 with synchronous foreign transaction resolution.

+1


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

04 June 2021, 08:16:42

On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
<ikedamsh@oss.nttdata.com> wrote:
>
>
>
> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
>
> Thank you for pointing it out. This idea has been proposed several
> times and there were discussions. I'd like to summarize the proposed
> ideas and those pros and cons before replying to your other comments.
>
> There are 3 ideas. After backend both prepares all foreign transaction
> and commit the local transaction,
>
> 1. the backend continues attempting to commit all prepared foreign
> transactions until all of them are committed.
> 2. the backend attempts to commit all prepared foreign transactions
> once. If an error happens, leave them for the resolver.
> 3. the backend asks the resolver that launched per foreign server to
> commit the prepared foreign transactions (and backend waits or doesn't
> wait for the commit completion depending on the setting).
>
> With ideas 1 and 2, since the backend itself commits all foreign
> transactions the resolver process cannot be a bottleneck, and probably
> the code can get more simple as backends don't need to communicate
> with resolver processes.
>
> However, those have two problems we need to deal with:
>
>
> Thanks for sharing the summarize. I understood there are problems related to
> FDW implementation.
>
> First, users could get an error if an error happens during the backend
> committing prepared foreign transaction but the local transaction is
> already committed and some foreign transactions could also be
> committed, confusing users. There were two opinions to this problem:
> FDW developers should be responsible for writing FDW code such that
> any error doesn't happen during committing foreign transactions, and
> users can accept that confusion since an error could happen after
> writing the commit WAL even today without this 2PC feature. For the
> former point, I'm not sure it's always doable since even palloc()
> could raise an error and it seems hard to require all FDW developers
> to understand all possible paths of raising an error. And for the
> latter point, that's true but I think those cases are
> should-not-happen cases (i.g., rare cases) whereas the likelihood of
> an error during committing prepared transactions is not low (e.g., by
> network connectivity problem). I think we need to assume that that is
> not a rare case.
>
>
> Hmm… Sorry, I don’t have any good ideas now.
>
> If anything, I’m on second side which users accept the confusion though
> let users know a error happens before local commit is done or not is necessary
> because if the former case, users will execute the same query again.

Yeah, users will need to remember the XID of the last executed
transaction and check if it has been committed by pg_xact_status().

>
>
> The second problem is whether we can cancel committing foreign
> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> backend process commits prepared foreign transactions, it's FDW
> developers' responsibility to write code that is interruptible. I’m
> not sure it’s feasible for drivers for other databases.
>
>
> Sorry, my understanding is not clear.
>
> After all prepares are done, the foreign transactions will be committed.
> So, does this mean that FDW must leave the unresolved transaction to the transaction
> resolver and show some messages like “Since the transaction is already committed,
> the transaction will be resolved in background" ?

I think this would happen after the backend cancels COMMIT PREPARED.
To be able to cancel an in-progress query the backend needs to accept
the interruption and send the cancel request. postgres_fdw can do that
since libpq supports sending a query and waiting for the result but
I’m not sure about other drivers.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

04 June 2021, 08:43:46

On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> 1. the backend continues attempting to commit all prepared foreign
> > transactions until all of them are committed.
> > 2. the backend attempts to commit all prepared foreign transactions
> > once. If an error happens, leave them for the resolver.
> > 3. the backend asks the resolver that launched per foreign server to
> > commit the prepared foreign transactions (and backend waits or doesn't
> > wait for the commit completion depending on the setting).
> >
> > With ideas 1 and 2, since the backend itself commits all foreign
> > transactions the resolver process cannot be a bottleneck, and probably
> > the code can get more simple as backends don't need to communicate
> > with resolver processes.
> >
> > However, those have two problems we need to deal with:
> >
>
> > First, users could get an error if an error happens during the backend
> > committing prepared foreign transaction but the local transaction is
> > already committed and some foreign transactions could also be
> > committed, confusing users. There were two opinions to this problem:
> > FDW developers should be responsible for writing FDW code such that
> > any error doesn't happen during committing foreign transactions, and
> > users can accept that confusion since an error could happen after
> > writing the commit WAL even today without this 2PC feature.
>
> Why does the user have to get an error?  Once the local transaction has been prepared, which means all remote ones
alsohave been prepared, the whole transaction is determined to commit.  So, the user doesn't have to receive an error
aslong as the local node is alive. 

I think we should neither ignore the error thrown by FDW code nor
lower the error level (e.g., ERROR to WARNING).

>
> > And for the
> > latter point, that's true but I think those cases are
> > should-not-happen cases (i.g., rare cases) whereas the likelihood of
> > an error during committing prepared transactions is not low (e.g., by
> > network connectivity problem). I think we need to assume that that is
> > not a rare case.
>
> How do non-2PC and 2PC cases differ in the rarity of the error?

I think the main difference would be that in 2PC case there will be
network communications possibly with multiple servers after the local
commit.

>
>
> > The second problem is whether we can cancel committing foreign
> > transactions by pg_cancel_backend() (or pressing Ctl-c). If the
> > backend process commits prepared foreign transactions, it's FDW
> > developers' responsibility to write code that is interruptible. I’m
> > not sure it’s feasible for drivers for other databases.
>
> That's true not only for prepare and commit but also for other queries.  Why do we have to treat prepare and commit
specially?

Good point. This would not be a blocker for ideas 1 and 2 but is a
side benefit of idea 3.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

04 June 2021, 08:59:29

From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Why does the user have to get an error?  Once the local transaction has been
> prepared, which means all remote ones also have been prepared, the whole
> transaction is determined to commit.  So, the user doesn't have to receive an
> error as long as the local node is alive.
> 
> I think we should neither ignore the error thrown by FDW code nor
> lower the error level (e.g., ERROR to WARNING).

Why?  (Forgive me for asking relentlessly... by imagining me as a cute 7-year-old boy/girl asking "Why Dad?")


> > How do non-2PC and 2PC cases differ in the rarity of the error?
> 
> I think the main difference would be that in 2PC case there will be
> network communications possibly with multiple servers after the local
> commit.

Then, it's the same failure mode.  That is, the same failure could occur for both cases.  That doesn't require us to
differentiatebetween them.  Let's ignore this point from now on.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

04 June 2021, 11:08:38

On Fri, Jun 4, 2021 at 5:59 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Fri, Jun 4, 2021 at 5:04 PM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Why does the user have to get an error?  Once the local transaction has been
> > prepared, which means all remote ones also have been prepared, the whole
> > transaction is determined to commit.  So, the user doesn't have to receive an
> > error as long as the local node is alive.
> >
> > I think we should neither ignore the error thrown by FDW code nor
> > lower the error level (e.g., ERROR to WARNING).
>
> Why?  (Forgive me for asking relentlessly... by imagining me as a cute 7-year-old boy/girl asking "Why Dad?")

I think we should not reinterpret the severity of the error and lower
it. Especially, in this case, any kind of errors can be thrown. It
could be such a serious error that FDW developer wants to report to
the client. Do we lower even PANIC to a lower severity such as
WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
lowering ERROR (and FATAL) to WARNING, why do we regard only them as
non-error?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

04 June 2021, 12:38:00

On Fri, Jun 4, 2021 at 5:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
> <ikedamsh@oss.nttdata.com> wrote:
> >
> >
> >
> > 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
> >
> >
> > Thank you for pointing it out. This idea has been proposed several
> > times and there were discussions. I'd like to summarize the proposed
> > ideas and those pros and cons before replying to your other comments.
> >
> > There are 3 ideas. After backend both prepares all foreign transaction
> > and commit the local transaction,
> >
> > 1. the backend continues attempting to commit all prepared foreign
> > transactions until all of them are committed.
> > 2. the backend attempts to commit all prepared foreign transactions
> > once. If an error happens, leave them for the resolver.
> > 3. the backend asks the resolver that launched per foreign server to
> > commit the prepared foreign transactions (and backend waits or doesn't
> > wait for the commit completion depending on the setting).
> >
> > With ideas 1 and 2, since the backend itself commits all foreign
> > transactions the resolver process cannot be a bottleneck, and probably
> > the code can get more simple as backends don't need to communicate
> > with resolver processes.
> >
> > However, those have two problems we need to deal with:
> >
> >
> > Thanks for sharing the summarize. I understood there are problems related to
> > FDW implementation.
> >
> > First, users could get an error if an error happens during the backend
> > committing prepared foreign transaction but the local transaction is
> > already committed and some foreign transactions could also be
> > committed, confusing users. There were two opinions to this problem:
> > FDW developers should be responsible for writing FDW code such that
> > any error doesn't happen during committing foreign transactions, and
> > users can accept that confusion since an error could happen after
> > writing the commit WAL even today without this 2PC feature. For the
> > former point, I'm not sure it's always doable since even palloc()
> > could raise an error and it seems hard to require all FDW developers
> > to understand all possible paths of raising an error. And for the
> > latter point, that's true but I think those cases are
> > should-not-happen cases (i.g., rare cases) whereas the likelihood of
> > an error during committing prepared transactions is not low (e.g., by
> > network connectivity problem). I think we need to assume that that is
> > not a rare case.
> >
> >
> > Hmm… Sorry, I don’t have any good ideas now.
> >
> > If anything, I’m on second side which users accept the confusion though
> > let users know a error happens before local commit is done or not is necessary
> > because if the former case, users will execute the same query again.
>
> Yeah, users will need to remember the XID of the last executed
> transaction and check if it has been committed by pg_xact_status().

As the second idea, can we send something like a hint along with the
error (or send a new type of error) that indicates the error happened
after the transaction commit so that the client can decide whether or
not to ignore the error? That way, we can deal with the confusion led
by an error raised after the local commit by the existing post-commit
cleanup routines (and post-commit xact callbacks) as well as by FDW’s
commit prepared routine.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

"ikedamsh@oss.nttdata.com"

Date:

07 June 2021, 00:35:59


> 2021/06/04 17:16、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
> <ikedamsh@oss.nttdata.com> wrote:
>>
>>
>>
>> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>>
>>
>> Thank you for pointing it out. This idea has been proposed several
>> times and there were discussions. I'd like to summarize the proposed
>> ideas and those pros and cons before replying to your other comments.
>>
>> There are 3 ideas. After backend both prepares all foreign transaction
>> and commit the local transaction,
>>
>> 1. the backend continues attempting to commit all prepared foreign
>> transactions until all of them are committed.
>> 2. the backend attempts to commit all prepared foreign transactions
>> once. If an error happens, leave them for the resolver.
>> 3. the backend asks the resolver that launched per foreign server to
>> commit the prepared foreign transactions (and backend waits or doesn't
>> wait for the commit completion depending on the setting).
>>
>> With ideas 1 and 2, since the backend itself commits all foreign
>> transactions the resolver process cannot be a bottleneck, and probably
>> the code can get more simple as backends don't need to communicate
>> with resolver processes.
>>
>> However, those have two problems we need to deal with:
>>
>>
>> Thanks for sharing the summarize. I understood there are problems related to
>> FDW implementation.
>>
>> First, users could get an error if an error happens during the backend
>> committing prepared foreign transaction but the local transaction is
>> already committed and some foreign transactions could also be
>> committed, confusing users. There were two opinions to this problem:
>> FDW developers should be responsible for writing FDW code such that
>> any error doesn't happen during committing foreign transactions, and
>> users can accept that confusion since an error could happen after
>> writing the commit WAL even today without this 2PC feature. For the
>> former point, I'm not sure it's always doable since even palloc()
>> could raise an error and it seems hard to require all FDW developers
>> to understand all possible paths of raising an error. And for the
>> latter point, that's true but I think those cases are
>> should-not-happen cases (i.g., rare cases) whereas the likelihood of
>> an error during committing prepared transactions is not low (e.g., by
>> network connectivity problem). I think we need to assume that that is
>> not a rare case.
>>
>>
>> Hmm… Sorry, I don’t have any good ideas now.
>>
>> If anything, I’m on second side which users accept the confusion though
>> let users know a error happens before local commit is done or not is necessary
>> because if the former case, users will execute the same query again.
>
> Yeah, users will need to remember the XID of the last executed
> transaction and check if it has been committed by pg_xact_status().
>
>>
>>
>> The second problem is whether we can cancel committing foreign
>> transactions by pg_cancel_backend() (or pressing Ctl-c). If the
>> backend process commits prepared foreign transactions, it's FDW
>> developers' responsibility to write code that is interruptible. I’m
>> not sure it’s feasible for drivers for other databases.
>>
>>
>> Sorry, my understanding is not clear.
>>
>> After all prepares are done, the foreign transactions will be committed.
>> So, does this mean that FDW must leave the unresolved transaction to the transaction
>> resolver and show some messages like “Since the transaction is already committed,
>> the transaction will be resolved in background" ?
>
> I think this would happen after the backend cancels COMMIT PREPARED.
> To be able to cancel an in-progress query the backend needs to accept
> the interruption and send the cancel request. postgres_fdw can do that
> since libpq supports sending a query and waiting for the result but
> I’m not sure about other drivers.

Thanks, I understood that handling this issue is not scope of the 2PC feature
as Tsunakawa-san and you said,

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

"ikedamsh@oss.nttdata.com"

Date:

07 June 2021, 00:57:45


> 2021/06/04 21:38、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>
> On Fri, Jun 4, 2021 at 5:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Fri, Jun 4, 2021 at 3:58 PM ikedamsh@oss.nttdata.com
>> <ikedamsh@oss.nttdata.com> wrote:
>>>
>>>
>>>
>>> 2021/06/04 12:28、Masahiko Sawada <sawada.mshk@gmail.com>のメール:
>>>
>>>
>>> Thank you for pointing it out. This idea has been proposed several
>>> times and there were discussions. I'd like to summarize the proposed
>>> ideas and those pros and cons before replying to your other comments.
>>>
>>> There are 3 ideas. After backend both prepares all foreign transaction
>>> and commit the local transaction,
>>>
>>> 1. the backend continues attempting to commit all prepared foreign
>>> transactions until all of them are committed.
>>> 2. the backend attempts to commit all prepared foreign transactions
>>> once. If an error happens, leave them for the resolver.
>>> 3. the backend asks the resolver that launched per foreign server to
>>> commit the prepared foreign transactions (and backend waits or doesn't
>>> wait for the commit completion depending on the setting).
>>>
>>> With ideas 1 and 2, since the backend itself commits all foreign
>>> transactions the resolver process cannot be a bottleneck, and probably
>>> the code can get more simple as backends don't need to communicate
>>> with resolver processes.
>>>
>>> However, those have two problems we need to deal with:
>>>
>>>
>>> Thanks for sharing the summarize. I understood there are problems related to
>>> FDW implementation.
>>>
>>> First, users could get an error if an error happens during the backend
>>> committing prepared foreign transaction but the local transaction is
>>> already committed and some foreign transactions could also be
>>> committed, confusing users. There were two opinions to this problem:
>>> FDW developers should be responsible for writing FDW code such that
>>> any error doesn't happen during committing foreign transactions, and
>>> users can accept that confusion since an error could happen after
>>> writing the commit WAL even today without this 2PC feature. For the
>>> former point, I'm not sure it's always doable since even palloc()
>>> could raise an error and it seems hard to require all FDW developers
>>> to understand all possible paths of raising an error. And for the
>>> latter point, that's true but I think those cases are
>>> should-not-happen cases (i.g., rare cases) whereas the likelihood of
>>> an error during committing prepared transactions is not low (e.g., by
>>> network connectivity problem). I think we need to assume that that is
>>> not a rare case.
>>>
>>>
>>> Hmm… Sorry, I don’t have any good ideas now.
>>>
>>> If anything, I’m on second side which users accept the confusion though
>>> let users know a error happens before local commit is done or not is necessary
>>> because if the former case, users will execute the same query again.
>>
>> Yeah, users will need to remember the XID of the last executed
>> transaction and check if it has been committed by pg_xact_status().
>
> As the second idea, can we send something like a hint along with the
> error (or send a new type of error) that indicates the error happened
> after the transaction commit so that the client can decide whether or
> not to ignore the error? That way, we can deal with the confusion led
> by an error raised after the local commit by the existing post-commit
> cleanup routines (and post-commit xact callbacks) as well as by FDW’s
> commit prepared routine.


I think your second idea is better because it’s easier for users to know what
error happens and there is nothing users should do. Since the focus of "hint”
is how to fix the problem, is it appropriate to use "context”?

FWIF, I took a fast look to elog.c and I found there is “error_context_stack”.
So, why don’t you add the context which shows like "the transaction fate is
decided to COMMIT (or ROLLBACK). So, even if error happens, the transaction
will be resolved in background” after the local commit?


Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

08 June 2021, 00:47:08

From: Masahiko Sawada <sawada.mshk@gmail.com>
> I think we should not reinterpret the severity of the error and lower
> it. Especially, in this case, any kind of errors can be thrown. It
> could be such a serious error that FDW developer wants to report to
> the client. Do we lower even PANIC to a lower severity such as
> WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> non-error?

Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

FYI, the tx_commit() in the X/Open TX interface and the UserTransaction.commit() in JTA don't return such an error,
IIRC. Do TX_FAIL and SystemException serve such a purpose?  I don't feel like that.
 


[Tuxedo manual (Japanese)]
https://docs.oracle.com/cd/F25597_01/document/products/tuxedo/tux80j/atmi/rf3c91.htm


[JTA]
public interface javax.transaction.UserTransaction 
public void commit()
 throws RollbackException, HeuristicMixedException, 
HeuristicRollbackException, SecurityException, 
IllegalStateException, SystemException 

Throws: RollbackException 
Thrown to indicate that the transaction has been rolled back rather than committed. 

Throws: HeuristicMixedException 
Thrown to indicate that a heuristic decision was made and that some relevant updates have been 
committed while others have been rolled back. 

Throws: HeuristicRollbackException 
Thrown to indicate that a heuristic decision was made and that all relevant updates have been rolled 
back. 

Throws: SecurityException 
Thrown to indicate that the thread is not allowed to commit the transaction. 

Throws: IllegalStateException 
Thrown if the current thread is not associated with a transaction. 

Throws: SystemException 
Thrown if the transaction manager encounters an unexpected error condition. 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

08 June 2021, 07:32:14

On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > I think we should not reinterpret the severity of the error and lower
> > it. Especially, in this case, any kind of errors can be thrown. It
> > could be such a serious error that FDW developer wants to report to
> > the client. Do we lower even PANIC to a lower severity such as
> > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > non-error?
>
> Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

It's not necessarily on a remote server. It could be a problem with
the local server.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Kyotaro Horiguchi

Date:

08 June 2021, 08:09:58

(I have caught up here. Sorry in advance for possible pointless
discussion by me..)

At Tue, 8 Jun 2021 00:47:08 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > I think we should not reinterpret the severity of the error and lower
> > it. Especially, in this case, any kind of errors can be thrown. It
> > could be such a serious error that FDW developer wants to report to
> > the client. Do we lower even PANIC to a lower severity such as
> > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > non-error?
> 
> Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?

I think the discussion is based the behavior that any process that is
responsible for finishing the 2pc-commit continue retrying remote
commits until all of the remote-commits succeed.

Maybe in most cases the errors duing remote-prepared-commit could be
retry-able but as Sawada-san says I'm also not sure it's always the
case.  On the other hand, it could be said that we have no other way
than retrying the remote-commits if we want to get over, say, instant
network failures automatically.  It is somewhat similar to
WAL-restoration that continues complaining for recovery_commands
failure without exiting.

> FYI, the tx_commit() in the X/Open TX interface and the UserTransaction.commit() in JTA don't return such an error,
IIRC. Do TX_FAIL and SystemException serve such a purpose?  I don't feel like that.
 

I'm not sure about how JTA works in detail, but doesn't
UserTransaction.commit() return HeuristicMixedExcpetion when some of
relevant updates have been committed but other not? Isn't it the same
state with the case where some of the remote servers failed on
remote-commit while others are succeeded?  (I guess that
UserTransaction.commit() would throw RollbackException if
remote-prepare has been failed for any of the remotes.)


> [Tuxedo manual (Japanese)]
> https://docs.oracle.com/cd/F25597_01/document/products/tuxedo/tux80j/atmi/rf3c91.htm
> 
> 
> [JTA]
> public interface javax.transaction.UserTransaction 
> public void commit()
>  throws RollbackException, HeuristicMixedException, 
> HeuristicRollbackException, SecurityException, 
> IllegalStateException, SystemException 
> 
> Throws: RollbackException 
> Thrown to indicate that the transaction has been rolled back rather than committed. 
> 
> Throws: HeuristicMixedException 
> Thrown to indicate that a heuristic decision was made and that some relevant updates have been 
> committed while others have been rolled back. 
> 
> Throws: HeuristicRollbackException 
> Thrown to indicate that a heuristic decision was made and that all relevant updates have been rolled 
> back. 
> 
> Throws: SecurityException 
> Thrown to indicate that the thread is not allowed to commit the transaction. 
> 
> Throws: IllegalStateException 
> Thrown if the current thread is not associated with a transaction. 
> 
> Throws: SystemException 
> Thrown if the transaction manager encounters an unexpected error condition. 
> 
> 
> Regards
> Takayuki Tsunakawa

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Transactions involving multiple postgres foreign servers, take 2

From

Kyotaro Horiguchi

Date:

08 June 2021, 08:11:51

At Tue, 8 Jun 2021 16:32:14 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in 
> On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> >
> > From: Masahiko Sawada <sawada.mshk@gmail.com>
> > > I think we should not reinterpret the severity of the error and lower
> > > it. Especially, in this case, any kind of errors can be thrown. It
> > > could be such a serious error that FDW developer wants to report to
> > > the client. Do we lower even PANIC to a lower severity such as
> > > WARNING? That's definitely a bad idea. If we don’t lower PANIC whereas
> > > lowering ERROR (and FATAL) to WARNING, why do we regard only them as
> > > non-error?
> >
> > Why does the client have to know the error on a remote server, whereas the global transaction itself is destined to
commit?
> 
> It's not necessarily on a remote server. It could be a problem with
> the local server.

Isn't it a discussion about the errors from postgres_fdw?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

08 June 2021, 08:28:32

From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Why does the client have to know the error on a remote server, whereas the
> global transaction itself is destined to commit?
> 
> It's not necessarily on a remote server. It could be a problem with
> the local server.

Then, in what kind of scenario are we talking about the difficulty, and how is it difficult to handle, when we adopt
eitherthe method 1 or 2?  (I'd just like to have the same clear picture.)  For example,

1. All FDWs prepared successfully.
2. The local transaction prepared successfully, too.
3. Some FDWs committed successfully.
4. One FDW failed to send the commit request because the remote server went down.

Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

08 June 2021, 08:45:24

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> I think the discussion is based the behavior that any process that is
> responsible for finishing the 2pc-commit continue retrying remote
> commits until all of the remote-commits succeed.

Thank you for coming back.  We're talking about the first attempt to prepare and commit in each transaction, not the
retrycase.
 


> > Throws: HeuristicMixedException
> > Thrown to indicate that a heuristic decision was made and that some
> relevant updates have been
> > committed while others have been rolled back.

> I'm not sure about how JTA works in detail, but doesn't
> UserTransaction.commit() return HeuristicMixedExcpetion when some of
> relevant updates have been committed but other not? Isn't it the same
> state with the case where some of the remote servers failed on
> remote-commit while others are succeeded?

No.  Taking the description literally and considering the relevant XA specification, it's not about the remote commit
failure. The remote server is not allowed to fail the commit once it has reported successful prepare, which is the
contractof 2PC.  HeuristicMixedException is about the manual resolution, typically by the DBA, using the DBMS-specific
toolor the standard commit()/rollback() API.
 


> (I guess that
> UserTransaction.commit() would throw RollbackException if
> remote-prepare has been failed for any of the remotes.)

Correct.


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

09 June 2021, 04:26:49

On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Tue, Jun 8, 2021 at 9:47 AM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Why does the client have to know the error on a remote server, whereas the
> > global transaction itself is destined to commit?
> >
> > It's not necessarily on a remote server. It could be a problem with
> > the local server.
>
> Then, in what kind of scenario are we talking about the difficulty, and how is it difficult to handle, when we adopt
eitherthe method 1 or 2?  (I'd just like to have the same clear picture.) 

IMO, even though FDW's commit/rollback transaction code could be
simple in some cases, I think we need to think that any kind of errors
(or even FATAL or PANIC) could be thrown from the FDW code. It could
be an error due to a temporary network problem, remote server down,
driver’s unexpected error, or out of memory etc. Errors that happened
after the local transaction commit doesn't affect the global
transaction decision, as you mentioned. But the proccess or system
could be in a bad state. Also, users might expect the process to exit
on error by setting  exit_on_error = on. Your idea sounds like that we
have to ignore any errors happening after the local commit if they
don’t affect the transaction outcome. It’s too scary to me and I think
that it's a bad idea to blindly ignore all possible errors under such
conditions. That could make the thing worse and will likely be
foot-gun. It would be good if we can prove that it’s safe to ignore
those errors but not sure how we can at least for me.

This situation is true even today; an error could happen after
committing the transaction. But I personally don’t want to add the
code that increases the likelihood.

Just to be clear, with your idea, we will ignore only ERROR or also
FATAL and PANIC? And if an error happens during committing one of the
prepared transactions on the foreign server, will we proceed with
committing other transactions or return OK to the client?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

09 June 2021, 07:10:48

From: Masahiko Sawada <sawada.mshk@gmail.com>
> On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Then, in what kind of scenario are we talking about the difficulty, and how is
> it difficult to handle, when we adopt either the method 1 or 2?  (I'd just like to
> have the same clear picture.)
> 
> IMO, even though FDW's commit/rollback transaction code could be
> simple in some cases, I think we need to think that any kind of errors
> (or even FATAL or PANIC) could be thrown from the FDW code. It could
> be an error due to a temporary network problem, remote server down,
> driver’s unexpected error, or out of memory etc. Errors that happened
> after the local transaction commit doesn't affect the global
> transaction decision, as you mentioned. But the proccess or system
> could be in a bad state. Also, users might expect the process to exit
> on error by setting  exit_on_error = on. Your idea sounds like that we
> have to ignore any errors happening after the local commit if they
> don’t affect the transaction outcome. It’s too scary to me and I think
> that it's a bad idea to blindly ignore all possible errors under such
> conditions. That could make the thing worse and will likely be
> foot-gun. It would be good if we can prove that it’s safe to ignore
> those errors but not sure how we can at least for me.
> 
> This situation is true even today; an error could happen after
> committing the transaction. But I personally don’t want to add the
> code that increases the likelihood.

I'm not talking about the code simplicity here (actually, I haven't reviewed the code around prepare and commit in the
patchyet...)  Also, I don't understand well what you're trying to insist and what realistic situations you have in mind
byciting exit_on_error, FATAL, PANIC and so on.  I just asked (in a different part) why the client has to know the
error.

Just to be clear, I'm not saying that we should hide the error completely behind the scenes.  For example, you can
allowthe FDW to emit a WARNING if the DBMS-specific client driver returns an error when committing.  Further, if you
wantto allow the FDW to throw an ERROR when committing, the transaction manager in core can catch it by PG_TRY(), so
thatit can report back successfull commit of the global transaction to the client while it leaves the handling of
failedcommit of the FDW to the resolver.  (I don't think we like to use PG_TRY() during transaction commit for
performancereasons, though.)

Let's give it a hundred steps and let's say we want to report the error of the committing FDW to the client.  If that's
thecase, we can use SQLSTATE 02xxx (Warning) and attach the error message.

> Just to be clear, with your idea, we will ignore only ERROR or also
> FATAL and PANIC? And if an error happens during committing one of the
> prepared transactions on the foreign server, will we proceed with
> committing other transactions or return OK to the client?

Neither FATAL nor PANIC can be ignored.  When FATAL, which means the termination of a particular session, the
committingof the remote transaction should be taken over by the resolver.  Not to mention PANIC; we can't do anything.
Otherwise,we proceed with committing other FDWs, hand off the task of committing the failed FDW to the resolver, and
reportsuccess to the client.  If you're not convinced, I'd like to ask you to investigate the code of some Java EE app
server,say GlassFish, and share with us how it handles an error during commit.

Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

09 June 2021, 07:25:17

On Wed, Jun 9, 2021 at 4:10 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Masahiko Sawada <sawada.mshk@gmail.com>
> > On Tue, Jun 8, 2021 at 5:28 PM tsunakawa.takay@fujitsu.com
> > <tsunakawa.takay@fujitsu.com> wrote:
> > > Then, in what kind of scenario are we talking about the difficulty, and how is
> > it difficult to handle, when we adopt either the method 1 or 2?  (I'd just like to
> > have the same clear picture.)
> >
> > IMO, even though FDW's commit/rollback transaction code could be
> > simple in some cases, I think we need to think that any kind of errors
> > (or even FATAL or PANIC) could be thrown from the FDW code. It could
> > be an error due to a temporary network problem, remote server down,
> > driver’s unexpected error, or out of memory etc. Errors that happened
> > after the local transaction commit doesn't affect the global
> > transaction decision, as you mentioned. But the proccess or system
> > could be in a bad state. Also, users might expect the process to exit
> > on error by setting  exit_on_error = on. Your idea sounds like that we
> > have to ignore any errors happening after the local commit if they
> > don’t affect the transaction outcome. It’s too scary to me and I think
> > that it's a bad idea to blindly ignore all possible errors under such
> > conditions. That could make the thing worse and will likely be
> > foot-gun. It would be good if we can prove that it’s safe to ignore
> > those errors but not sure how we can at least for me.
> >
> > This situation is true even today; an error could happen after
> > committing the transaction. But I personally don’t want to add the
> > code that increases the likelihood.
>
> I'm not talking about the code simplicity here (actually, I haven't reviewed the code around prepare and commit in
thepatch yet...)  Also, I don't understand well what you're trying to insist and what realistic situations you have in
mindby citing exit_on_error, FATAL, PANIC and so on.  I just asked (in a different part) why the client has to know the
error.
>
> Just to be clear, I'm not saying that we should hide the error completely behind the scenes.  For example, you can
allowthe FDW to emit a WARNING if the DBMS-specific client driver returns an error when committing.  Further, if you
wantto allow the FDW to throw an ERROR when committing, the transaction manager in core can catch it by PG_TRY(), so
thatit can report back successfull commit of the global transaction to the client while it leaves the handling of
failedcommit of the FDW to the resolver.  (I don't think we like to use PG_TRY() during transaction commit for
performancereasons, though.) 
>
> Let's give it a hundred steps and let's say we want to report the error of the committing FDW to the client.  If
that'sthe case, we can use SQLSTATE 02xxx (Warning) and attach the error message. 
>

Maybe it's better to start a new thread to discuss this topic. If your
idea is good, we can lower all error that happened after writing the
commit record to warning, reducing the cases where the client gets
confusion by receiving an error after the commit.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

09 June 2021, 08:07:40

From: Masahiko Sawada <sawada.mshk@gmail.com>
> Maybe it's better to start a new thread to discuss this topic. If your
> idea is good, we can lower all error that happened after writing the
> commit record to warning, reducing the cases where the client gets
> confusion by receiving an error after the commit.

No.  It's an important part because it determines the 2PC behavior and performance.  This discussion had started from
theconcern about performance before Ikeda-san reported pathological results.  Don't rush forward, hoping someone will
committhe current patch.  I'm afraid you just don't want to change your design and code.  Let's face the real issue.
 

As I said before, and as Ikeda-san's performance benchmark results show, I have to say the design isn't done
sufficiently. I talked with Fujii-san the other day about this patch.  The patch is already huge and it's difficult to
decodehow the patch works, e.g., what kind of new WALs it emits, how many disk writes it adds, how the error is
handled,whether/how it's different from the textbook or other existing designs, etc.  What happend to my request to add
suchdesign description to the following page, so that reviewers can consider the design before spending much time on
lookingat the code?  What's the situation of the new FDW API that should naturally accommodate other FDW
implementations?

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Design should come first.  I don't think it's a sincere attitude to require reviewers to spend long time to read the
designfrom huge code.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Kyotaro Horiguchi

Date:

10 June 2021, 02:04:20

At Tue, 8 Jun 2021 08:45:24 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > I think the discussion is based the behavior that any process that is
> > responsible for finishing the 2pc-commit continue retrying remote
> > commits until all of the remote-commits succeed.
> 
> Thank you for coming back.  We're talking about the first attempt to prepare and commit in each transaction, not the
retrycase.
 

If we accept each elementary-commit (via FDW connection) to fail, the
parent(?) there's no way the root 2pc-commit can succeed.  How can we
ignore the fdw-error in that case?

> > > Throws: HeuristicMixedException
> > > Thrown to indicate that a heuristic decision was made and that some
> > relevant updates have been
> > > committed while others have been rolled back.
> 
> > I'm not sure about how JTA works in detail, but doesn't
> > UserTransaction.commit() return HeuristicMixedExcpetion when some of
> > relevant updates have been committed but other not? Isn't it the same
> > state with the case where some of the remote servers failed on
> > remote-commit while others are succeeded?
> 
> No.  Taking the description literally and considering the relevant XA specification, it's not about the remote commit
failure. The remote server is not allowed to fail the commit once it has reported successful prepare, which is the
contractof 2PC.  HeuristicMixedException is about the manual resolution, typically by the DBA, using the DBMS-specific
toolor the standard commit()/rollback() API.
 

Mmm. The above seems as if saying that 2pc-comit does not interact
with remotes.  The interface contract does not cover everything that
happens in the real world. If remote-commit fails, that is just an
issue outside of the 2pc world.  In reality remote-commit may fail for
all reasons.

https://www.ibm.com/docs/ja/db2-for-zos/11?topic=support-example-distributed-transaction-that-uses-jta-methods

>      }      catch (javax.transaction.xa.XAException xae)
>      { // Distributed transaction failed, so roll it back.
>        // Report XAException on prepare/commit.

This suggests that both XAResoruce.prepare() and commit() can throw a
exception.

> > (I guess that
> > UserTransaction.commit() would throw RollbackException if
> > remote-prepare has been failed for any of the remotes.)
> 
> Correct.

So UserTransaction.commit() does not throw the same exception if
remote-commit fails.  Isn't the HeuristicMixedExcpetion the exception
thrown in that case?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

10 June 2021, 07:08:37

From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> If we accept each elementary-commit (via FDW connection) to fail, the
> parent(?) there's no way the root 2pc-commit can succeed.  How can we
> ignore the fdw-error in that case?

No, we don't ignore the error during FDW commit.  As mentioned at the end of this mail, the question is how the FDW
reportsthe eror to the caller (transaction  manager in Postgres core), and how we should handle it. 

As below, Glassfish catches the resource manager's error during commit, retries the commit if the error is transient or
communicationfailure, and hands off the processing of failed commit to the recovery manager.  (I used all of my energy
today;I'd be grateful if someone could figure out whether Glassfish reports the error to the application.) 


[XATerminatorImpl.java]
    public void commit(Xid xid, boolean onePhase) throws XAException {
...
                } else {
                    coord.commit();
                }


[TopCoordinator.java]
        // Commit all participants.  If a fatal error occurs during
        // this method, then the process must be ended with a fatal error.
...
            try {
                participants.distributeCommit();
            } catch (Throwable exc) {


[RegisteredResources.java]
    void distributeCommit() throws HeuristicMixed, HeuristicHazard, NotPrepared {
...
        // Browse through the participants, committing them. The following is
        // intended to be done asynchronously as a group of operations.
...
                // Tell the resource to commit.
                // Catch any exceptions here; keep going until
                // no exception is left.
...
                            // If the exception is neither TRANSIENT or
                            // COMM_FAILURE, it is unexpected, so display a
                            // message and give up with this Resource.
...
                            // For TRANSIENT or COMM_FAILURE, wait
                            // for a while, then retry the commit.
...
                            // If the retry limit has been exceeded,
                            // end the process with a fatal error.
...
        if (!transactionCompleted) {
            if (coord != null)
                RecoveryManager.addToIncompleTx(coord, true);


> > No.  Taking the description literally and considering the relevant XA
> specification, it's not about the remote commit failure.  The remote server is
> not allowed to fail the commit once it has reported successful prepare, which is
> the contract of 2PC.  HeuristicMixedException is about the manual resolution,
> typically by the DBA, using the DBMS-specific tool or the standard
> commit()/rollback() API.
>
> Mmm. The above seems as if saying that 2pc-comit does not interact
> with remotes.  The interface contract does not cover everything that
> happens in the real world. If remote-commit fails, that is just an
> issue outside of the 2pc world.  In reality remote-commit may fail for
> all reasons.

The following part of XA specification is relevant.  We're considering to model the FDW 2PC interface based on XA,
becauseit seems like the only standard interface and thus other FDWS would naturally take advantage of, aren't we?
Then,we need to take care of such things as this.  The interface design is not easy.  So, proper design and its review
shouldcome first, before going deeper into the huge code patch. 

2.3.3 Heuristic Branch Completion
--------------------------------------------------
Some RMs may employ heuristic decision-making: an RM that has prepared to
commit a transaction branch may decide to commit or roll back its work independently
of the TM. It could then unlock shared resources. This may leave them in an
inconsistent state. When the TM ultimately directs an RM to complete the branch, the
RM may respond that it has already done so. The RM reports whether it committed
the branch, rolled it back, or completed it with mixed results (committed some work
and rolled back other work).

An RM that reports heuristic completion to the TM must not discard its knowledge of
the transaction branch. The TM calls the RM once more to authorise it to forget the
branch. This requirement means that the RM must notify the TM of all heuristic
decisions, even those that match the decision the TM requested. The referenced
OSI DTP specifications (model) and (service) define heuristics more precisely.
--------------------------------------------------


> https://www.ibm.com/docs/ja/db2-for-zos/11?topic=support-example-distr
> ibuted-transaction-that-uses-jta-methods
> This suggests that both XAResoruce.prepare() and commit() can throw a
> exception.

Yes, XAResource() can throw an exception:

void commit(Xid xid, boolean onePhase) throws XAException

Throws: XAException
An error has occurred. Possible XAExceptions are XA_HEURHAZ, XA_HEURCOM,
XA_HEURRB, XA_HEURMIX, XAER_RMERR, XAER_RMFAIL, XAER_NOTA,
XAER_INVAL, or XAER_PROTO.

This is equivalent to xa_commit() in the XA specification.  xa_commit() can return an error code that have the same
namesas above. 

The question we're trying to answer here is:

* How such an error should be handled?
Glassfish (and possibly other Java EE servers) catch the error, continue to commit the rest of participants, and handle
thefailed resource manager's commit in the background.  In Postgres, if we allow FDWs to do ereport(ERROR), how can we
dosimilar things? 

* Should we report the error to the client?  If yes, should it be reported as a failure of commit, or as an
informationalmessage (WARNING) of a successful commit?  Why does the client want to know the error, where the global
transaction'scommit has been promised? 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Robert Haas

Date:

10 June 2021, 16:33:43

On Fri, Jun 4, 2021 at 4:04 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> Why does the user have to get an error? Once the local transaction has been prepared, which means all remote ones
alsohave been prepared, the whole transaction is determined to commit. So, the user doesn't have to receive an error
aslong as the local node is alive.

That is completely unrealistic. As Sawada-san has pointed out
repeatedly, there are tons of things that can go wrong even after the
remote side has prepared the transaction. Preparing a transaction only
promises that the remote side will let you commit the transaction upon
request. It doesn't guarantee that you'll be able to make the request.
Like Sawada-san says, network problems, out of memory issues, or many
other things could stop that from happening. Someone could come along
in another session and run "ROLLBACK PREPARED" on the remote side, and
now the "COMMIT PREPARED" will never succeed no matter how many times
you try it. At least, not unless someone goes and creates a new
prepared transaction with the same 2PC identifier, but then you won't
be committing the correct transaction anyway. Or someone could take
the remote server and drop it in a volcano. How do you propose that we
avoid giving the user an error after the remote server has been
dropped into a volcano, even though the local node is still alive?

Also, leaving aside theoretical arguments, I think it's not
realistically possible for an FDW author to write code to commit a
prepared transaction that will be safe in the context of running late
in PrepareTransaction(), after we've already done
RecordTransactionCommit(). Such code can't avoid throwing errors
because it can't avoid performing operations and allocating memory.
It's already been mentioned that, if an ERROR is thrown, it would be
reported to the user in place of the COMMIT acknowledgement that they
are expecting. Now, it has also been suggested that we could downgrade
the ERROR to a WARNING and still report the COMMIT. That doesn't sound
easy to do, because when the ERROR happens, control is going to jump
to AbortTransaction(). But even if you could hack it so it works like
that, it doesn't really solve the problem. What about all of the other
servers where the prepared transaction also needs to be committed? In
the design of PostgreSQL, in all circumstances, the way you recover
from an error is to abort the transaction. That is what brings the
system back to a clean state. You can't simply ignore the requirement
to abort the transaction and keep doing more work. It will never be
reliable, and Tom will instantaneously demand that any code works like
that be reverted -- and for good reason.

I am not sure that it's 100% impossible to find a way to solve this
problem without just having the resolver do all the work, but I think
it's going to be extremely difficult. We tried to figure out some
vaguely similar things while working on undo, and it really didn't go
very well. The later stages of CommitTransaction() and
AbortTransaction() are places where very few kinds of code are safe to
execute, and finding a way to patch around that problem is not simple
either. If the resolver performance is poor, perhaps we could try to
find a way to improve it. I don't know. But I don't think it does any
good to say, well, no errors can occur after the remote transaction is
prepared. That's clearly incorrect.

--
Robert Haas
EDB: http://www.enterprisedb.com

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

11 June 2021, 01:58:26

From: Robert Haas <robertmhaas@gmail.com>
> That is completely unrealistic. As Sawada-san has pointed out
> repeatedly, there are tons of things that can go wrong even after the
> remote side has prepared the transaction. Preparing a transaction only
> promises that the remote side will let you commit the transaction upon
> request. It doesn't guarantee that you'll be able to make the request.
> Like Sawada-san says, network problems, out of memory issues, or many
> other things could stop that from happening. Someone could come along
> in another session and run "ROLLBACK PREPARED" on the remote side, and
> now the "COMMIT PREPARED" will never succeed no matter how many times
> you try it. At least, not unless someone goes and creates a new
> prepared transaction with the same 2PC identifier, but then you won't
> be committing the correct transaction anyway. Or someone could take
> the remote server and drop it in a volcano. How do you propose that we
> avoid giving the user an error after the remote server has been
> dropped into a volcano, even though the local node is still alive?

I understand that.  As I cited yesterday and possibly before, that's why xa_commit() returns various return codes.  So,
Ihave never suggested that FDWs should not report an error and always report success for the commit request.  They
shouldbe allowed to report an error.
 

The question I have been asking is how.  With that said, we should only have two options; one is the return value of
theFDW commit routine, and the other is via ereport(ERROR).  I suggested the possibility of the former, because if the
FDWdoes ereport(ERROR), Postgres core (transaction manager) may have difficulty in handling the rest of the
participants.


> Also, leaving aside theoretical arguments, I think it's not
> realistically possible for an FDW author to write code to commit a
> prepared transaction that will be safe in the context of running late
> in PrepareTransaction(), after we've already done
> RecordTransactionCommit(). Such code can't avoid throwing errors
> because it can't avoid performing operations and allocating memory.

I'm not completely sure about this.  I thought (and said) that the only thing the FDW does would be to send a commit
requestthrough an existing connection.  So, I think it's not a severe restriction to require FDWs to do ereport(ERROR)
duringcommits (of the second phase of 2PC.)
 


> It's already been mentioned that, if an ERROR is thrown, it would be
> reported to the user in place of the COMMIT acknowledgement that they
> are expecting. Now, it has also been suggested that we could downgrade
> the ERROR to a WARNING and still report the COMMIT. That doesn't sound
> easy to do, because when the ERROR happens, control is going to jump
> to AbortTransaction(). But even if you could hack it so it works like
> that, it doesn't really solve the problem. What about all of the other
> servers where the prepared transaction also needs to be committed? In
> the design of PostgreSQL, in all circumstances, the way you recover
> from an error is to abort the transaction. That is what brings the
> system back to a clean state. You can't simply ignore the requirement
> to abort the transaction and keep doing more work. It will never be
> reliable, and Tom will instantaneously demand that any code works like
> that be reverted -- and for good reason.

(I took "abort" as the same as "rollback" here.)  Once we've sent commit requests to some participants, we can't abort
thetransaction.  If one FDW returned an error halfway, we need to send commit requests to the rest of participants.
 

It's a design question, as I repeatedly said, whether and how we should report the error of some participants to the
client. For instance, how should we report the errors of multiple participants?  Concatenate those error messages?
 

Anyway, we should design the interface first, giving much thought and respecting the ideas of predecessors (TX/XA, MS
DTC,JTA/JTS).  Otherwise, we may end up like "We implemented like this, so the interface is like this and it can only
behavelike this, although you may find it strange..."  That might be a situation similar to what your comment "the
designof PostgreSQL, in all circumstances, the way you recover from an error is to abort the transaction" suggests --
Postgresdoesn't have statement-level rollback.
 


> I am not sure that it's 100% impossible to find a way to solve this
> problem without just having the resolver do all the work, but I think
> it's going to be extremely difficult. We tried to figure out some
> vaguely similar things while working on undo, and it really didn't go
> very well. The later stages of CommitTransaction() and
> AbortTransaction() are places where very few kinds of code are safe to
> execute, and finding a way to patch around that problem is not simple
> either. If the resolver performance is poor, perhaps we could try to
> find a way to improve it. I don't know. But I don't think it does any
> good to say, well, no errors can occur after the remote transaction is
> prepared. That's clearly incorrect.

I don't think the resolver-based approach would bring us far enough.  It's fundamentally a bottleneck.  Such a
backgroundprocess should only handle commits whose requests failed to be sent due to server down.
 

My requests are only twofold and haven't changed for long: design the FDW interface that implementors can naturally
follow,and design to ensure performance.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Robert Haas

Date:

11 June 2021, 12:50:42

On Thu, Jun 10, 2021 at 9:58 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> I understand that.  As I cited yesterday and possibly before, that's why xa_commit() returns various return codes.
So,I have never suggested that FDWs should not report an error and always report success for the commit request.  They
shouldbe allowed to report an error. 

In the text to which I was responding it seemed like you were saying
the opposite. Perhaps I misunderstood.

> The question I have been asking is how.  With that said, we should only have two options; one is the return value of
theFDW commit routine, and the other is via ereport(ERROR).  I suggested the possibility of the former, because if the
FDWdoes ereport(ERROR), Postgres core (transaction manager) may have difficulty in handling the rest of the
participants.

I don't think that is going to work. It is very difficult to write
code that doesn't ever ERROR in PostgreSQL. It is not impossible if
the operation is trivial enough, but I think you're greatly
underestimating the complexity of committing the remote transaction.
If somebody had designed PostgreSQL so that every function returns a
return code and every time you call some other function you check that
return code and pass any error up to your own caller, then there would
be no problem here. But in fact the design was that at the first sign
of trouble you throw an ERROR. It's not easy to depart from that
programming model in just one place.

> > Also, leaving aside theoretical arguments, I think it's not
> > realistically possible for an FDW author to write code to commit a
> > prepared transaction that will be safe in the context of running late
> > in PrepareTransaction(), after we've already done
> > RecordTransactionCommit(). Such code can't avoid throwing errors
> > because it can't avoid performing operations and allocating memory.
>
> I'm not completely sure about this.  I thought (and said) that the only thing the FDW does would be to send a commit
requestthrough an existing connection.  So, I think it's not a severe restriction to require FDWs to do ereport(ERROR)
duringcommits (of the second phase of 2PC.) 

To send a commit request through an existing connection, you have to
send some bytes over the network using a send() or write() system
call. That can fail. Then you have to read the response back over the
network using recv() or read(). That can also fail. You also need to
parse the result that you get from the remote side, which can also
fail, because you could get back garbage for some reason. And
depending on the details, you might first need to construct the
message you're going to send, which might be able to fail too. Also,
the data might be encrypted using SSL, so you might have to decrypt
it, which can also fail, and you might need to encrypt data before
sending it, which can fail. In fact, if you're using the OpenSSL,
trying to call SSL_read() or SSL_write() can both read and write data
from the socket, even multiple times, so you have extra opportunities
to fail.

> (I took "abort" as the same as "rollback" here.)  Once we've sent commit requests to some participants, we can't
abortthe transaction.  If one FDW returned an error halfway, we need to send commit requests to the rest of
participants.

I understand that it's not possible to abort the local transaction to
abort after it's been committed, but that doesn't mean that we're
going to be able to send the commit requests to the rest of the
participants. We want to be able to do that, certainly, but there's no
guarantee that it's actually possible. Again, the remote servers may
be dropped into a volcano, or less seriously, we may not be able to
access them. Also, someone may kill off our session.

> It's a design question, as I repeatedly said, whether and how we should report the error of some participants to the
client. For instance, how should we report the errors of multiple participants?  Concatenate those error messages? 

Sure, I agree that there are some questions about how to report errors.

> Anyway, we should design the interface first, giving much thought and respecting the ideas of predecessors (TX/XA, MS
DTC,JTA/JTS).  Otherwise, we may end up like "We implemented like this, so the interface is like this and it can only
behavelike this, although you may find it strange..."  That might be a situation similar to what your comment "the
designof PostgreSQL, in all circumstances, the way you recover from an error is to abort the transaction" suggests --
Postgresdoesn't have statement-level rollback. 

I think that's a valid concern, but we also have to have a plan that
is realistic. Some things are indeed not possible in PostgreSQL's
design. Also, some of these problems are things everyone has to
somehow confront. There's no database doing 2PC that can't have a
situation where one of the machines disappears unexpectedly due to
some natural disaster or administrator interference. It might be the
case that our inability to do certain things safely during transaction
commit puts us out of compliance with the spec, but it can't be the
case that some other system has no possible failures during
transaction commit. The problem of the network potentially being
disconnected between one packet and the next exists in every system.

> I don't think the resolver-based approach would bring us far enough.  It's fundamentally a bottleneck.  Such a
backgroundprocess should only handle commits whose requests failed to be sent due to server down. 

Why is it fundamentally a bottleneck? It seems to me in some cases it
could scale better than any other approach. If we have to commit on
100 shards in only one process we can only do those commits one at a
time. If we can use resolver processes we could do all 100 at once if
the user can afford to run that many resolvers, which should be way
faster. It is true that if the resolver does not have a connection
open and must open one, that might be slow, but presumably after that
it can keep the connection open and reuse it for subsequent
distributed transactions. I don't really see why that should be
particularly slow.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

11 June 2021, 16:25:21

On 2021/05/11 13:37, Masahiko Sawada wrote:
> I've attached the updated patches that incorporated comments from
> Zhihong and Ikeda-san.

Thanks for updating the patches!

I'm still reading these patches, but I'd like to share some review comments
that I found so far.

(1)
+/* Remove the foreign transaction from FdwXactParticipants */
+void
+FdwXactUnregisterXact(UserMapping *usermapping)
+{
+    Assert(IsTransactionState());
+    RemoveFdwXactEntry(usermapping->umid);
+}

Currently there is no user of FdwXactUnregisterXact().
This function should be removed?

(2)
When I ran the regression test, I got the following failure.

========= Contents of ./src/test/modules/test_fdwxact/regression.diffs
diff -U3 /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out
/home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out
--- /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out    2021-06-10
02:19:43.808622747+0000

+++ /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out    2021-06-10
02:29:53.452410462+0000

@@ -174,7 +174,7 @@
  SELECT count(*) FROM pg_foreign_xacts;
   count
  -------
-     1
+     4
  (1 row)

(3)
+                 errmsg("could not read foreign transaction state from xlog at %X/%X",
+                        (uint32) (lsn >> 32),
+                        (uint32) lsn)));

LSN_FORMAT_ARGS() should be used?

(4)
+extern void RecreateFdwXactFile(TransactionId xid, Oid umid, void *content,
+                                int len);

Since RecreateFdwXactFile() is used only in fdwxact.c,
the above "extern" is not necessary?

(5)
+2. Pre-Commit phase (1st phase of two-phase commit)
+we record the corresponding WAL indicating that the foreign server is involved
+with the current transaction before doing PREPARE all foreign transactions.
+Thus, in case we loose connectivity to the foreign server or crash ourselves,
+we will remember that we might have prepared tranascation on the foreign
+server, and try to resolve it when connectivity is restored or after crash
+recovery.

So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
for WAL record to be replicated to the standby if sync replication is enabled?
Otherwise, when the failover happens, new primary (past-standby)
might not have enough XLOG_FDWXACT_INSERT WAL records and
might fail to find some in-doubt foreign transactions.

(6)
XLogFlush() is called for each foreign transaction. So if there are many
foreign transactions, XLogFlush() is called too frequently. Which might
cause unnecessary performance overhead? Instead, for example,
we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
after inserting all WAL records for all foreign transactions?

(7)
      /* Open connection; report that we'll create a prepared statement. */
      fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
+    MarkConnectionModified(user);

MarkConnectionModified() should be called also when TRUNCATE on
a foreign table is executed?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

14 June 2021, 02:04:31

From: Robert Haas <robertmhaas@gmail.com>
> On Thu, Jun 10, 2021 at 9:58 PM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > The question I have been asking is how.  With that said, we should only have
> two options; one is the return value of the FDW commit routine, and the other is
> via ereport(ERROR).  I suggested the possibility of the former, because if the
> FDW does ereport(ERROR), Postgres core (transaction manager) may have
> difficulty in handling the rest of the participants.
> 
> I don't think that is going to work. It is very difficult to write
> code that doesn't ever ERROR in PostgreSQL. It is not impossible if
> the operation is trivial enough, but I think you're greatly
> underestimating the complexity of committing the remote transaction.
> If somebody had designed PostgreSQL so that every function returns a
> return code and every time you call some other function you check that
> return code and pass any error up to your own caller, then there would
> be no problem here. But in fact the design was that at the first sign
> of trouble you throw an ERROR. It's not easy to depart from that
> programming model in just one place.

> > I'm not completely sure about this.  I thought (and said) that the only thing
> the FDW does would be to send a commit request through an existing
> connection.  So, I think it's not a severe restriction to require FDWs to do
> ereport(ERROR) during commits (of the second phase of 2PC.)
> 
> To send a commit request through an existing connection, you have to
> send some bytes over the network using a send() or write() system
> call. That can fail. Then you have to read the response back over the
> network using recv() or read(). That can also fail. You also need to
> parse the result that you get from the remote side, which can also
> fail, because you could get back garbage for some reason. And
> depending on the details, you might first need to construct the
> message you're going to send, which might be able to fail too. Also,
> the data might be encrypted using SSL, so you might have to decrypt
> it, which can also fail, and you might need to encrypt data before
> sending it, which can fail. In fact, if you're using the OpenSSL,
> trying to call SSL_read() or SSL_write() can both read and write data
> from the socket, even multiple times, so you have extra opportunities
> to fail.

I know sending a commit request may get an error from various underlying functions, but we're talking about the client
side,not the Postgres's server side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW commit routine
won'tlose control and can return an error code as its return value.  For instance, the FDW commit routine for DBMS-X
wouldtypically be:
 

int
DBMSXCommit(...)
{
    int ret;

    /* extract info from the argument to pass to xa_commit() */

    ret = DBMSX_xa_commit(...);
    /* This is the actual commit function which is exposed to the app server (e.g. Tuxedo) through the xa_commit()
interface*/
 

    /* map xa_commit() return values to the corresponding return values of the FDW commit routine */
    switch (ret)
    {
        case XA_RMERR:
            ret = ...;
            break;
        ...
    }

    return ret;
}


> I think that's a valid concern, but we also have to have a plan that
> is realistic. Some things are indeed not possible in PostgreSQL's
> design. Also, some of these problems are things everyone has to
> somehow confront. There's no database doing 2PC that can't have a
> situation where one of the machines disappears unexpectedly due to
> some natural disaster or administrator interference. It might be the
> case that our inability to do certain things safely during transaction
> commit puts us out of compliance with the spec, but it can't be the
> case that some other system has no possible failures during
> transaction commit. The problem of the network potentially being
> disconnected between one packet and the next exists in every system.

So, we need to design how commit behaves from the user's perspective.  That's the functional design.  We should figure
outwhat's the desirable response of commit first, and then see if we can implement it or have to compromise in some
way. I think we can reference the X/Open TX standard and/or JTS (Java Transaction Service) specification (I haven't had
achance to read them yet, though.)  Just in case we can't find the requested commit behavior in the volcano case from
thosespecifications, ... (I'm hesitant to say this because it may be hard,) it's desirable to follow representative
productssuch as Tuxedo and GlassFish (the reference implementation of Java EE specs.)
 


> > I don't think the resolver-based approach would bring us far enough.  It's
> fundamentally a bottleneck.  Such a background process should only handle
> commits whose requests failed to be sent due to server down.
> 
> Why is it fundamentally a bottleneck? It seems to me in some cases it
> could scale better than any other approach. If we have to commit on
> 100 shards in only one process we can only do those commits one at a
> time. If we can use resolver processes we could do all 100 at once if
> the user can afford to run that many resolvers, which should be way
> faster. It is true that if the resolver does not have a connection
> open and must open one, that might be slow, but presumably after that
> it can keep the connection open and reuse it for subsequent
> distributed transactions. I don't really see why that should be
> particularly slow.

Concurrent transactions are serialized at the resolver.  I heard that the current patch handles 2PC like this: the TM
(transactionmanager in Postgres core) requests prepare to the resolver, the resolver sends prepare to the remote server
andwait for reply, the TM gets back control from the resolver, TM requests commit to the resolver, the resolver sends
committo the remote server and wait for reply, and TM gets back control.  The resolver handles one transaction at a
time.

In regard to the case where one session has to commit on multiple remote servers, we're talking about the asynchronous
interfacejust like what the XA standard provides.
 


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Robert Haas

Date:

14 June 2021, 16:08:51

On Sun, Jun 13, 2021 at 10:04 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> I know sending a commit request may get an error from various underlying functions, but we're talking about the
clientside, not the Postgres's server side that could unexpectedly ereport(ERROR) somewhere.  So, the new FDW commit
routinewon't lose control and can return an error code as its return value.  For instance, the FDW commit routine for
DBMS-Xwould typically be: 
>
> int
> DBMSXCommit(...)
> {
>         int ret;
>
>         /* extract info from the argument to pass to xa_commit() */
>
>         ret = DBMSX_xa_commit(...);
>         /* This is the actual commit function which is exposed to the app server (e.g. Tuxedo) through the
xa_commit()interface */ 
>
>         /* map xa_commit() return values to the corresponding return values of the FDW commit routine */
>         switch (ret)
>         {
>                 case XA_RMERR:
>                         ret = ...;
>                         break;
>                 ...
>         }
>
>         return ret;
> }

Well, we're talking about running this commit routine from within
CommitTransaction(), right? So I think it is in fact running in the
server. And if that's so, then you have to worry about how to make it
respond to interrupts. You can't just call some functions
DBMSX_xa_commit() and wait for infinite time for it to return. Look at
pgfdw_get_result() for an example of what real code to do this looks
like.

> So, we need to design how commit behaves from the user's perspective.  That's the functional design.  We should
figureout what's the desirable response of commit first, and then see if we can implement it or have to compromise in
someway.  I think we can reference the X/Open TX standard and/or JTS (Java Transaction Service) specification (I
haven'thad a chance to read them yet, though.)  Just in case we can't find the requested commit behavior in the volcano
casefrom those specifications, ... (I'm hesitant to say this because it may be hard,) it's desirable to follow
representativeproducts such as Tuxedo and GlassFish (the reference implementation of Java EE specs.) 

Honestly, I am not quite sure what any specification has to say about
this. We're talking about what happens when a user does something with
a foreign table and then type COMMIT. That's all about providing a set
of behaviors that are consistent with how PostgreSQL works in other
situations. You can't negotiate away the requirement to handle errors
in a way that works with PostgreSQL's infrastructure, or the
requirement that any length operation handle interrupts properly, by
appealing to a specification.

> Concurrent transactions are serialized at the resolver.  I heard that the current patch handles 2PC like this: the TM
(transactionmanager in Postgres core) requests prepare to the resolver, the resolver sends prepare to the remote server
andwait for reply, the TM gets back control from the resolver, TM requests commit to the resolver, the resolver sends
committo the remote server and wait for reply, and TM gets back control.  The resolver handles one transaction at a
time.

That sounds more like a limitation of the present implementation than
a fundamental problem. We shouldn't reject the idea of having a
resolver process handle this just because the initial implementation
might be slow. If there's no fundamental problem with the idea,
parallelism and concurrency can be improved in separate patches at a
later time. It's much more important at this stage to reject ideas
that are not theoretically sound.

--
Robert Haas
EDB: http://www.enterprisedb.com

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

15 June 2021, 09:51:07

From: Robert Haas <robertmhaas@gmail.com>
> Well, we're talking about running this commit routine from within
> CommitTransaction(), right? So I think it is in fact running in the
> server. And if that's so, then you have to worry about how to make it
> respond to interrupts. You can't just call some functions
> DBMSX_xa_commit() and wait for infinite time for it to return. Look at
> pgfdw_get_result() for an example of what real code to do this looks
> like.

Postgres can do that, but other implementations can not necessaily do it, I'm afraid.  But before that, the FDW
interfacedocumentation doesn't describe anything about how to handle interrupts.  Actually, odbc_fdw and possibly other
FDWsdon't respond to interrupts.
 


> Honestly, I am not quite sure what any specification has to say about
> this. We're talking about what happens when a user does something with
> a foreign table and then type COMMIT. That's all about providing a set
> of behaviors that are consistent with how PostgreSQL works in other
> situations. You can't negotiate away the requirement to handle errors
> in a way that works with PostgreSQL's infrastructure, or the
> requirement that any length operation handle interrupts properly, by
> appealing to a specification.

What we're talking here is mainly whether commit should return success or failure when some participants failed to
commitin the second phase of 2PC.  That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
specificationsand (well-known) implementations before we conclude that we have to devise our own behavior.
 


> That sounds more like a limitation of the present implementation than
> a fundamental problem. We shouldn't reject the idea of having a
> resolver process handle this just because the initial implementation
> might be slow. If there's no fundamental problem with the idea,
> parallelism and concurrency can be improved in separate patches at a
> later time. It's much more important at this stage to reject ideas
> that are not theoretically sound.

We talked about that, and unfortunately, I haven't seen a good and feasible idea to enhance the current approach that
involvesthe resolver from the beginning of 2PC processing.  Honestly, I don't understand why such a "one prepare, one
commitin turn" serialization approach can be allowed in PostgreSQL where developers pursue best performance and even
triesto refrain from adding an if statement in a hot path.  As I showed and Ikeda-san said, other implementations have
eachclient session send prepare and commit requests.  That's a natural way to achieve reasonable concurrency and
performance.


Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

From

Robert Haas

Date:

16 June 2021, 16:07:36

On Tue, Jun 15, 2021 at 5:51 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
> Postgres can do that, but other implementations can not necessaily do it, I'm afraid. But before that, the FDW
interfacedocumentation doesn't describe anything about how to handle interrupts. Actually, odbc_fdw and possibly other
FDWsdon't respond to interrupts.

Well, I'd consider that a bug.

> What we're talking here is mainly whether commit should return success or failure when some participants failed to
commitin the second phase of 2PC. That's new to Postgres, isn't it? Anyway, we should respect existing relevant
specificationsand (well-known) implementations before we conclude that we have to devise our own behavior.

Sure ... but we can only decide to do things that the implementation
can support, and running code that might fail after we've committed
locally isn't one of them.

> We talked about that, and unfortunately, I haven't seen a good and feasible idea to enhance the current approach that
involvesthe resolver from the beginning of 2PC processing. Honestly, I don't understand why such a "one prepare, one
commitin turn" serialization approach can be allowed in PostgreSQL where developers pursue best performance and even
triesto refrain from adding an if statement in a hot path. As I showed and Ikeda-san said, other implementations have
eachclient session send prepare and commit requests. That's a natural way to achieve reasonable concurrency and
performance.

I think your comparison here is quite unfair. We work hard to add
overhead in hot paths where it might cost, but the FDW case involves a
network round-trip anyway, so the cost of an if-statement would surely
be insignificant. I feel like you want to assume without any evidence
that a local resolver can never be quick enough, even thought the cost
of IPC between local processes shouldn't be that high compared to a
network round trip. But you also want to suppose that we can run code
that might fail late in the commit process even though there is lots
of evidence that this will cause problems, starting with the code
comments that clearly say so.

--
Robert Haas
EDB: http://www.enterprisedb.com

RE: Transactions involving multiple postgres foreign servers, take 2

From

"tsunakawa.takay@fujitsu.com"

Date:

18 June 2021, 02:48:34

From: Robert Haas <robertmhaas@gmail.com>
> On Tue, Jun 15, 2021 at 5:51 AM tsunakawa.takay@fujitsu.com
> <tsunakawa.takay@fujitsu.com> wrote:
> > Postgres can do that, but other implementations can not necessaily do it, I'm
> afraid.  But before that, the FDW interface documentation doesn't describe
> anything about how to handle interrupts.  Actually, odbc_fdw and possibly
> other FDWs don't respond to interrupts.
> 
> Well, I'd consider that a bug.

I kind of hesitate to call it a bug...  Unlike libpq, JDBC (for jdbc_fdw) doesn't have asynchronous interface, and
Oracleand PostgreSQL ODBC drivers don't support asynchronous interface.  Even with libpq, COMMIT (and other SQL
commands)is not always cancellable, e.g., when the (NFS) storage server gets hand while writing WAL.

> > What we're talking here is mainly whether commit should return success or
> failure when some participants failed to commit in the second phase of 2PC.
> That's new to Postgres, isn't it?  Anyway, we should respect existing relevant
> specifications and (well-known) implementations before we conclude that we
> have to devise our own behavior.
> 
> Sure ... but we can only decide to do things that the implementation
> can support, and running code that might fail after we've committed
> locally isn't one of them.

Yes, I understand that Postgres may not be able to conform to specifications or well-known implementations in all
aspects. I'm just suggesting to take the stance "We carefully considered established industry specifications that we
canbase on, did our best to design the desirable behavior learned from them, but couldn't implement a few parts",
ratherthan "We did what we like and can do."

> I think your comparison here is quite unfair. We work hard to add
> overhead in hot paths where it might cost, but the FDW case involves a
> network round-trip anyway, so the cost of an if-statement would surely
> be insignificant. I feel like you want to assume without any evidence
> that a local resolver can never be quick enough, even thought the cost
> of IPC between local processes shouldn't be that high compared to a
> network round trip. But you also want to suppose that we can run code
> that might fail late in the commit process even though there is lots
> of evidence that this will cause problems, starting with the code
> comments that clearly say so.

There may be better examples.  What I wanted to say is just that I believe it's not PG developers' standard to allow
serialprepare and commit.  Let's make it clear what's difficult to do the 2PC from each client session in normal
operationwithout going through the resolver.

Regards
Takayuki Tsunakawa

RE: Transactions involving multiple postgres foreign servers, take 2

From

"k.jamison@fujitsu.com"

Date:

24 June 2021, 12:46:47

Hi Sawada-san,

I also tried to play a bit with the latest patches similar to Ikeda-san,
and with foreign 2PC parameter enabled/required.

> > >> b. about performance bottleneck (just share my simple benchmark
> > >> results)
> > >>
> > >> The resolver process can be performance bottleneck easily although
> > >> I think some users want this feature even if the performance is not so
> good.
> > >>
> > >> I tested with very simple workload in my laptop.
> > >>
> > >> The test condition is
> > >> * two remote foreign partitions and one transaction inserts an
> > >> entry in each partitions.
> > >> * local connection only. If NW latency became higher, the
> > >> performance became worse.
> > >> * pgbench with 8 clients.
> > >>
> > >> The test results is the following. The performance of 2PC is only
> > >> 10% performance of the one of without 2PC.
> > >>
> > >> * with foreign_twophase_commit = requried
> > >> -> If load with more than 10TPS, the number of unresolved foreign
> > >> -> transactions
> > >> is increasing and stop with the warning "Increase
> > >> max_prepared_foreign_transactions".
> > >
> > > What was the value of max_prepared_foreign_transactions?
> >
> > Now, I tested with 200.
> >
> > If each resolution is finished very soon, I thought it's enough
> > because 8clients x 2partitions = 16, though... But, it's difficult how
> > to know the stable values.
> 
> During resolving one distributed transaction, the resolver needs both one
> round trip and fsync-ing WAL record for each foreign transaction.
> Since the client doesn’t wait for the distributed transaction to be resolved,
> the resolver process can be easily bottle-neck given there are 8 clients.
> 
> If foreign transaction resolution was resolved synchronously, 16 would
> suffice.


I tested the V36 patches on my 16-core machines.
I setup two foreign servers (F1, F2) .
F1 has addressbook table.
F2 has pgbench tables (scale factor = 1).
There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
I executed the benchmark measurement on coordinator.
My custom scripts are setup in a way that queries from coordinator
would have to access the two foreign servers.

Coordinator:
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 1
foreign_twophase_commit = required

Other external servers 1 & 2 (F1 & F2):
max_prepared_transactions = 100


[select.sql]
\set int random(1, 100000)
BEGIN;
SELECT ad.name, ad.age, ac.abalance
FROM addressbook ad, pgbench_accounts ac
WHERE ad.id = :int AND ad.id = ac.aid;
COMMIT;

I then executed:
pgbench -r -c 2 -j 2 -T 60 -f select.sql coor

While there were no problems with 1-2 clients, I started having problems
when running the benchmark with more than 3 clients.

pgbench -r -c 4 -j 4 -T 60 -f select.sql coor

I got the following error on coordinator:

[95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
[95396] STATEMENT:  COMMIT;
WARNING:  there is no transaction in progress
pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2 with
IDfx_151455979_1216200_16422
 

Here's the log on foreign server 2 <F2> matching the above error:
<F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
<F2> ERROR:  maximum number of prepared transactions reached
<F2> HINT:  Increase max_prepared_transactions (currently 100).
<F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'

So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
Then I got the error:

[146926] ERROR:  maximum number of foreign transactions reached
[146926] HINT:  Increase max_prepared_foreign_transactions: "200".

So I increased the max_prepared_foreign_transactions to "300",
and got the same error of need to increase the max_prepared_transactions of foreign servers.

I just can't find the right tuning values for this.
It seems that we always run out of memory in FdwXactState insert_fdwxact 
with multiple concurrent connections during PREPARE TRANSACTION.
This one I only encountered for SELECT benchmark. 
Although I've got no problems with multiple connections for my custom scripts for
UPDATE and INSERT benchmarks when I tested up to 30 clients.

Would the following possibly solve this bottleneck problem?

> > > To speed up the foreign transaction resolution, some ideas have been
> > > discussed. As another idea, how about launching resolvers for each
> > > foreign server? That way, we resolve foreign transactions on each
> > > foreign server in parallel. If foreign transactions are concentrated
> > > on the particular server, we can have multiple resolvers for the one
> > > foreign server. It doesn’t change the fact that all foreign
> > > transaction resolutions are processed by resolver processes.
> >
> > Awesome! There seems to be another pros that even if a foreign server
> > is temporarily busy or stopped due to fail over, other foreign
> > server's transactions can be resolved.
> 
> Yes. We also might need to be careful about the order of foreign transaction
> resolution. I think we need to resolve foreign transactions in arrival order at
> least within a foreign server.

Regards,
Kirk Jamison

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

24 June 2021, 13:11:08

On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2021/05/11 13:37, Masahiko Sawada wrote:
> > I've attached the updated patches that incorporated comments from
> > Zhihong and Ikeda-san.
>
> Thanks for updating the patches!
>
> I'm still reading these patches, but I'd like to share some review comments
> that I found so far.

Thank you for the comments!

>
> (1)
> +/* Remove the foreign transaction from FdwXactParticipants */
> +void
> +FdwXactUnregisterXact(UserMapping *usermapping)
> +{
> +       Assert(IsTransactionState());
> +       RemoveFdwXactEntry(usermapping->umid);
> +}
>
> Currently there is no user of FdwXactUnregisterXact().
> This function should be removed?

I think that this function can be used by  other FDW implementations
to unregister foreign transaction entry, although there is no use case
in postgres_fdw. This function corresponds to xa_unreg in the XA
specification.

>
>
> (2)
> When I ran the regression test, I got the following failure.
>
> ========= Contents of ./src/test/modules/test_fdwxact/regression.diffs
> diff -U3 /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out
/home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out
> --- /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/expected/test_fdwxact.out     2021-06-10
02:19:43.808622747+0000
 
> +++ /home/runner/work/postgresql/postgresql/src/test/modules/test_fdwxact/results/test_fdwxact.out      2021-06-10
02:29:53.452410462+0000
 
> @@ -174,7 +174,7 @@
>   SELECT count(*) FROM pg_foreign_xacts;
>    count
>   -------
> -     1
> +     4
>   (1 row)

WIll fix.

>
>
> (3)
> +                                errmsg("could not read foreign transaction state from xlog at %X/%X",
> +                                               (uint32) (lsn >> 32),
> +                                               (uint32) lsn)));
>
> LSN_FORMAT_ARGS() should be used?

Agreed.

>
>
> (4)
> +extern void RecreateFdwXactFile(TransactionId xid, Oid umid, void *content,
> +                                                               int len);
>
> Since RecreateFdwXactFile() is used only in fdwxact.c,
> the above "extern" is not necessary?

Right.

>
>
> (5)
> +2. Pre-Commit phase (1st phase of two-phase commit)
> +we record the corresponding WAL indicating that the foreign server is involved
> +with the current transaction before doing PREPARE all foreign transactions.
> +Thus, in case we loose connectivity to the foreign server or crash ourselves,
> +we will remember that we might have prepared tranascation on the foreign
> +server, and try to resolve it when connectivity is restored or after crash
> +recovery.
>
> So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
> XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
> for WAL record to be replicated to the standby if sync replication is enabled?
> Otherwise, when the failover happens, new primary (past-standby)
> might not have enough XLOG_FDWXACT_INSERT WAL records and
> might fail to find some in-doubt foreign transactions.

But even if we wait for the record to be replicated, this problem
isn't completely resolved, right? If the server crashes before the
standy receives the record and the failover happens then the new
master doesn't have the record. I wonder if we need to have another
FDW API in order to get the list of prepared transactions from the
foreign server (FDW). For example in postgres_fdw case, it gets the
list of prepared transactions on the foreign server by executing a
query. It seems to me that this corresponds to xa_recover in the XA
specification.

> (6)
> XLogFlush() is called for each foreign transaction. So if there are many
> foreign transactions, XLogFlush() is called too frequently. Which might
> cause unnecessary performance overhead? Instead, for example,
> we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
> after inserting all WAL records for all foreign transactions?

Agreed.

>
>
> (7)
>         /* Open connection; report that we'll create a prepared statement. */
>         fmstate->conn = GetConnection(user, true, &fmstate->conn_state);
> +       MarkConnectionModified(user);
>
> MarkConnectionModified() should be called also when TRUNCATE on
> a foreign table is executed?

Good catch. Will fix.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

24 June 2021, 13:27:40

On Thu, Jun 24, 2021 at 9:46 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
> I also tried to play a bit with the latest patches similar to Ikeda-san,
> and with foreign 2PC parameter enabled/required.

Thank you for testing the patch!

>
> > > >> b. about performance bottleneck (just share my simple benchmark
> > > >> results)
> > > >>
> > > >> The resolver process can be performance bottleneck easily although
> > > >> I think some users want this feature even if the performance is not so
> > good.
> > > >>
> > > >> I tested with very simple workload in my laptop.
> > > >>
> > > >> The test condition is
> > > >> * two remote foreign partitions and one transaction inserts an
> > > >> entry in each partitions.
> > > >> * local connection only. If NW latency became higher, the
> > > >> performance became worse.
> > > >> * pgbench with 8 clients.
> > > >>
> > > >> The test results is the following. The performance of 2PC is only
> > > >> 10% performance of the one of without 2PC.
> > > >>
> > > >> * with foreign_twophase_commit = requried
> > > >> -> If load with more than 10TPS, the number of unresolved foreign
> > > >> -> transactions
> > > >> is increasing and stop with the warning "Increase
> > > >> max_prepared_foreign_transactions".
> > > >
> > > > What was the value of max_prepared_foreign_transactions?
> > >
> > > Now, I tested with 200.
> > >
> > > If each resolution is finished very soon, I thought it's enough
> > > because 8clients x 2partitions = 16, though... But, it's difficult how
> > > to know the stable values.
> >
> > During resolving one distributed transaction, the resolver needs both one
> > round trip and fsync-ing WAL record for each foreign transaction.
> > Since the client doesn’t wait for the distributed transaction to be resolved,
> > the resolver process can be easily bottle-neck given there are 8 clients.
> >
> > If foreign transaction resolution was resolved synchronously, 16 would
> > suffice.
>
>
> I tested the V36 patches on my 16-core machines.
> I setup two foreign servers (F1, F2) .
> F1 has addressbook table.
> F2 has pgbench tables (scale factor = 1).
> There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
> I executed the benchmark measurement on coordinator.
> My custom scripts are setup in a way that queries from coordinator
> would have to access the two foreign servers.
>
> Coordinator:
> max_prepared_foreign_transactions = 200
> max_foreign_transaction_resolvers = 1
> foreign_twophase_commit = required
>
> Other external servers 1 & 2 (F1 & F2):
> max_prepared_transactions = 100
>
>
> [select.sql]
> \set int random(1, 100000)
> BEGIN;
> SELECT ad.name, ad.age, ac.abalance
> FROM addressbook ad, pgbench_accounts ac
> WHERE ad.id = :int AND ad.id = ac.aid;
> COMMIT;
>
> I then executed:
> pgbench -r -c 2 -j 2 -T 60 -f select.sql coor
>
> While there were no problems with 1-2 clients, I started having problems
> when running the benchmark with more than 3 clients.
>
> pgbench -r -c 4 -j 4 -T 60 -f select.sql coor
>
> I got the following error on coordinator:
>
> [95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
> [95396] STATEMENT:  COMMIT;
> WARNING:  there is no transaction in progress
> pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2
withID fx_151455979_1216200_16422 
>
> Here's the log on foreign server 2 <F2> matching the above error:
> <F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
> <F2> ERROR:  maximum number of prepared transactions reached
> <F2> HINT:  Increase max_prepared_transactions (currently 100).
> <F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>
> So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
> Then I got the error:
>
> [146926] ERROR:  maximum number of foreign transactions reached
> [146926] HINT:  Increase max_prepared_foreign_transactions: "200".
>
> So I increased the max_prepared_foreign_transactions to "300",
> and got the same error of need to increase the max_prepared_transactions of foreign servers.
>
> I just can't find the right tuning values for this.
> It seems that we always run out of memory in FdwXactState insert_fdwxact
> with multiple concurrent connections during PREPARE TRANSACTION.
> This one I only encountered for SELECT benchmark.
> Although I've got no problems with multiple connections for my custom scripts for
> UPDATE and INSERT benchmarks when I tested up to 30 clients.
>
> Would the following possibly solve this bottleneck problem?

With the following idea, the performance will get better but will not
be completely solved. Because those results shared by you and
Ikeda-san come from the fact that with the patch we asynchronously
commit the foreign prepared transaction (i.g., asynchronously
performing the second phase of 2PC), but not the architecture. As I
mentioned before, I intentionally removed the synchronous committing
foreign prepared transaction part from the patch set since we still
need to have a discussion of that part. Therefore, with this version
patch, the backend returns OK to the client right after the local
transaction commits with neither committing foreign prepared
transactions by itself nor waiting for those to be committed by the
resolver process.  As long as the backend doesn’t wait for foreign
prepared transactions to be committed and there is a limit of the
number of foreign prepared transactions to be held, it could reach the
upper bound if committing foreign prepared transactions cannot keep
up.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

24 June 2021, 23:40:48

On Thu, Jun 24, 2021 at 10:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >
> >
> >
> > (5)
> > +2. Pre-Commit phase (1st phase of two-phase commit)
> > +we record the corresponding WAL indicating that the foreign server is involved
> > +with the current transaction before doing PREPARE all foreign transactions.
> > +Thus, in case we loose connectivity to the foreign server or crash ourselves,
> > +we will remember that we might have prepared tranascation on the foreign
> > +server, and try to resolve it when connectivity is restored or after crash
> > +recovery.
> >
> > So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
> > XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
> > for WAL record to be replicated to the standby if sync replication is enabled?
> > Otherwise, when the failover happens, new primary (past-standby)
> > might not have enough XLOG_FDWXACT_INSERT WAL records and
> > might fail to find some in-doubt foreign transactions.
>
> But even if we wait for the record to be replicated, this problem
> isn't completely resolved, right?

Ah, I misunderstood the order of writing WAL records and preparing
foreign transactions. You're right. Combining your suggestion below,
perhaps we need to write all WAL records, call XLogFlush(), wait for
those records to be replicated, and prepare all foreign transactions.
Even in cases where the server crashes during preparing a foreign
transaction and the failover happens, the new master has all foreign
transaction entries. Some of them might not actually be prepared on
the foreign servers but it should not be a problem.

> > (6)
> > XLogFlush() is called for each foreign transaction. So if there are many
> > foreign transactions, XLogFlush() is called too frequently. Which might
> > cause unnecessary performance overhead? Instead, for example,
> > we should call XLogFlush() only at once in FdwXactPrepareForeignTransactions()
> > after inserting all WAL records for all foreign transactions?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

25 June 2021, 00:53:52


On 2021/06/24 22:27, Masahiko Sawada wrote:
> On Thu, Jun 24, 2021 at 9:46 PM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>>
>> Hi Sawada-san,
>>
>> I also tried to play a bit with the latest patches similar to Ikeda-san,
>> and with foreign 2PC parameter enabled/required.
> 
> Thank you for testing the patch!
> 
>>
>>>>>> b. about performance bottleneck (just share my simple benchmark
>>>>>> results)
>>>>>>
>>>>>> The resolver process can be performance bottleneck easily although
>>>>>> I think some users want this feature even if the performance is not so
>>> good.
>>>>>>
>>>>>> I tested with very simple workload in my laptop.
>>>>>>
>>>>>> The test condition is
>>>>>> * two remote foreign partitions and one transaction inserts an
>>>>>> entry in each partitions.
>>>>>> * local connection only. If NW latency became higher, the
>>>>>> performance became worse.
>>>>>> * pgbench with 8 clients.
>>>>>>
>>>>>> The test results is the following. The performance of 2PC is only
>>>>>> 10% performance of the one of without 2PC.
>>>>>>
>>>>>> * with foreign_twophase_commit = requried
>>>>>> -> If load with more than 10TPS, the number of unresolved foreign
>>>>>> -> transactions
>>>>>> is increasing and stop with the warning "Increase
>>>>>> max_prepared_foreign_transactions".
>>>>>
>>>>> What was the value of max_prepared_foreign_transactions?
>>>>
>>>> Now, I tested with 200.
>>>>
>>>> If each resolution is finished very soon, I thought it's enough
>>>> because 8clients x 2partitions = 16, though... But, it's difficult how
>>>> to know the stable values.
>>>
>>> During resolving one distributed transaction, the resolver needs both one
>>> round trip and fsync-ing WAL record for each foreign transaction.
>>> Since the client doesn’t wait for the distributed transaction to be resolved,
>>> the resolver process can be easily bottle-neck given there are 8 clients.
>>>
>>> If foreign transaction resolution was resolved synchronously, 16 would
>>> suffice.
>>
>>
>> I tested the V36 patches on my 16-core machines.
>> I setup two foreign servers (F1, F2) .
>> F1 has addressbook table.
>> F2 has pgbench tables (scale factor = 1).
>> There is also 1 coordinator (coor) server where I created user mapping to access the foreign servers.
>> I executed the benchmark measurement on coordinator.
>> My custom scripts are setup in a way that queries from coordinator
>> would have to access the two foreign servers.
>>
>> Coordinator:
>> max_prepared_foreign_transactions = 200
>> max_foreign_transaction_resolvers = 1
>> foreign_twophase_commit = required
>>
>> Other external servers 1 & 2 (F1 & F2):
>> max_prepared_transactions = 100
>>
>>
>> [select.sql]
>> \set int random(1, 100000)
>> BEGIN;
>> SELECT ad.name, ad.age, ac.abalance
>> FROM addressbook ad, pgbench_accounts ac
>> WHERE ad.id = :int AND ad.id = ac.aid;
>> COMMIT;
>>
>> I then executed:
>> pgbench -r -c 2 -j 2 -T 60 -f select.sql coor
>>
>> While there were no problems with 1-2 clients, I started having problems
>> when running the benchmark with more than 3 clients.
>>
>> pgbench -r -c 4 -j 4 -T 60 -f select.sql coor
>>
>> I got the following error on coordinator:
>>
>> [95396] ERROR:  could not prepare transaction on server F2 with ID fx_151455979_1216200_16422
>> [95396] STATEMENT:  COMMIT;
>> WARNING:  there is no transaction in progress
>> pgbench: error: client 1 script 0 aborted in command 3 query 0: ERROR:  could not prepare transaction on server F2
withID fx_151455979_1216200_16422
 
>>
>> Here's the log on foreign server 2 <F2> matching the above error:
>> <F2> LOG:  statement: PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>> <F2> ERROR:  maximum number of prepared transactions reached
>> <F2> HINT:  Increase max_prepared_transactions (currently 100).
>> <F2> STATEMENT:  PREPARE TRANSACTION 'fx_151455979_1216200_16422'
>>
>> So I increased the max_prepared_transactions of <F1> and <F2> from 100 to 200.
>> Then I got the error:
>>
>> [146926] ERROR:  maximum number of foreign transactions reached
>> [146926] HINT:  Increase max_prepared_foreign_transactions: "200".
>>
>> So I increased the max_prepared_foreign_transactions to "300",
>> and got the same error of need to increase the max_prepared_transactions of foreign servers.
>>
>> I just can't find the right tuning values for this.
>> It seems that we always run out of memory in FdwXactState insert_fdwxact
>> with multiple concurrent connections during PREPARE TRANSACTION.
>> This one I only encountered for SELECT benchmark.
>> Although I've got no problems with multiple connections for my custom scripts for
>> UPDATE and INSERT benchmarks when I tested up to 30 clients.
>>
>> Would the following possibly solve this bottleneck problem?
> 
> With the following idea, the performance will get better but will not
> be completely solved. Because those results shared by you and
> Ikeda-san come from the fact that with the patch we asynchronously
> commit the foreign prepared transaction (i.g., asynchronously
> performing the second phase of 2PC), but not the architecture. As I
> mentioned before, I intentionally removed the synchronous committing
> foreign prepared transaction part from the patch set since we still
> need to have a discussion of that part. Therefore, with this version
> patch, the backend returns OK to the client right after the local
> transaction commits with neither committing foreign prepared
> transactions by itself nor waiting for those to be committed by the
> resolver process.  As long as the backend doesn’t wait for foreign
> prepared transactions to be committed and there is a limit of the
> number of foreign prepared transactions to be held, it could reach the
> upper bound if committing foreign prepared transactions cannot keep
> up.

Hi Jamison-san, sawada-san,

Thanks for testing!

FWIF, I tested using pgbench with "--rate=" option to know the server
can execute transactions with stable throughput. As sawada-san said,
the latest patch resolved second phase of 2PC asynchronously. So,
it's difficult to control the stable throughput without "--rate=" option.

I also worried what I should do when the error happened because to increase
"max_prepared_foreign_transaction" doesn't work. Since too overloading may
show the error, is it better to add the case to the HINT message?


BTW, if sawada-san already develop to run the resolver processes in parallel,
why don't you measure performance improvement? Although Robert-san,
Tunakawa-san and so on are discussing what architecture is best, one
discussion point is that there is a performance risk if adopting asynchronous
approach. If we have promising solutions, I think we can make the discussion
forward.

In my understanding, there are three improvement idea. First is that to make
the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
PREPARED" remote servers in bulk. Third is to stop syncing the WAL
remove_fdwxact() after resolving is done, which I addressed in the mail sent
at June 3rd, 13:56. Since third idea is not yet discussed, there may
be my misunderstanding.

-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiro Ikeda

Date:

25 June 2021, 01:33:23

On 2021/06/24 22:11, Masahiko Sawada wrote:
> On Sat, Jun 12, 2021 at 1:25 AM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> On 2021/05/11 13:37, Masahiko Sawada wrote:
>> So currently FdwXactInsertEntry() calls XLogInsert() and XLogFlush() for
>> XLOG_FDWXACT_INSERT WAL record. Additionally we should also wait there
>> for WAL record to be replicated to the standby if sync replication is enabled?
>> Otherwise, when the failover happens, new primary (past-standby)
>> might not have enough XLOG_FDWXACT_INSERT WAL records and
>> might fail to find some in-doubt foreign transactions.
> 
> But even if we wait for the record to be replicated, this problem
> isn't completely resolved, right? If the server crashes before the
> standy receives the record and the failover happens then the new
> master doesn't have the record. I wonder if we need to have another
> FDW API in order to get the list of prepared transactions from the
> foreign server (FDW). For example in postgres_fdw case, it gets the
> list of prepared transactions on the foreign server by executing a
> query. It seems to me that this corresponds to xa_recover in the XA
> specification.

FWIF, Citus implemented as sawada-san said above [1].

Since each WAL record for PREPARE is flushing in the latest patch, the latency
became too much, especially under synchronous replication. For example, the
transaction involving three foreign servers must wait to sync "three" WAL
records for PREPARE and "one" WAL records for local commit in remote server
one by one sequentially. So, I think that Sawada-san's idea is good to improve
the latency although fdw developer's work increases.

[1]
SIGMOD 2021 525 Citus: Distributed PostgreSQL for Data Intensive Applications
From 12:27 says that how to solve unresolved prepared xacts.
https://www.youtube.com/watch?v=AlF4C60FdlQ&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

30 June 2021, 01:05:45

On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote:
>
> Hi Jamison-san, sawada-san,
>
> Thanks for testing!
>
> FWIF, I tested using pgbench with "--rate=" option to know the server
> can execute transactions with stable throughput. As sawada-san said,
> the latest patch resolved second phase of 2PC asynchronously. So,
> it's difficult to control the stable throughput without "--rate=" option.
>
> I also worried what I should do when the error happened because to increase
> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
> show the error, is it better to add the case to the HINT message?
>
> BTW, if sawada-san already develop to run the resolver processes in parallel,
> why don't you measure performance improvement? Although Robert-san,
> Tunakawa-san and so on are discussing what architecture is best, one
> discussion point is that there is a performance risk if adopting asynchronous
> approach. If we have promising solutions, I think we can make the discussion
> forward.

Yeah, if we can asynchronously resolve the distributed transactions
without worrying about max_prepared_foreign_transaction error, it
would be good. But we will need synchronous resolution at some point.
I think we at least need to discuss it at this point.

I've attached the new version patch that incorporates the comments
from Fujii-san and Ikeda-san I got so far. We launch a resolver
process per foreign server, committing prepared foreign transactions
on foreign servers in parallel. To get a better performance based on
the current architecture, we can have multiple resolver processes per
foreign server but it seems not easy to tune it in practice. Perhaps
is it better if we simply have a pool of resolver processes and we
assign a resolver process to the resolution of one distributed
transaction one by one? That way, we need to launch resolver processes
as many as the concurrent backends using 2PC.

> In my understanding, there are three improvement idea. First is that to make
> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
> remove_fdwxact() after resolving is done, which I addressed in the mail sent
> at June 3rd, 13:56. Since third idea is not yet discussed, there may
> be my misunderstanding.

Yes, those optimizations are promising. On the other hand, they could
introduce complexity to the code and APIs. I'd like to keep the first
version simple. I think we need to discuss them at this stage but can
leave the implementation of both parallel execution and batch
execution as future improvements.

For the third idea, I think the implementation was wrong; it removes
the state file then flushes the WAL record. I think these should be
performed in the reverse order. Otherwise, FdwXactState entry could be
left on the standby if the server crashes between them. I might be
missing something though.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Hi Sawada-san,


Thank you for your reply.

> Not sure but it might be possible to keep holding an xlogreader for
> reading PREPARE WAL records even after the transaction commit. But I
> wonder how much open() for wal segment file accounts for the total
> execution time of 2PC. 2PC requires 2 network round trips for each
> participant. For example, if it took 500ms in total, we would not get
> benefits much from the point of view of 2PC performance even if we
> improved it from 14ms to 1ms.

I made the patch based on your advice and re-run the test on the new machine.
(The attached patch is just for test purpose.)


* foreign_twophase_commit = disabled
2686tps

* foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
311tps

* foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
2057tps


This indicate that if we can reduce the number of times to open() wal segment file during "COMMIT PREPARED", the
performancecan be improved. 

This patch can skip closing wal segment file, but I don't know when we should close.
One idea is to close when the wal segment file is recycled, but it seems difficult for backend process to do so.

BTW, in previous discussion, "Send COMMIT PREPARED remote servers in bulk" is proposed.
I imagined the new SQL interface like "COMMIT PREPARED 'prep_1', 'prep_2', ... 'prep_n'".
If we can open wal segment file during bulk COMMIT PREPARED, we can not only reduce the times of communication, but
alsoreduce the times of open() wal segment file. 


Regards,
Ryohei Takahashi

Attachment

Hold_xlogreader.patch

Re: Transactions involving multiple postgres foreign servers, take 2

From

Ranier Vilela

Date:

13 July 2021, 11:24:55

Em ter., 13 de jul. de 2021 às 01:14, r.takahashi_2@fujitsu.com <r.takahashi_2@fujitsu.com> escreveu:

Hi Sawada-san,

Thank you for your reply.

> Not sure but it might be possible to keep holding an xlogreader for
> reading PREPARE WAL records even after the transaction commit. But I
> wonder how much open() for wal segment file accounts for the total
> execution time of 2PC. 2PC requires 2 network round trips for each
> participant. For example, if it took 500ms in total, we would not get
> benefits much from the point of view of 2PC performance even if we
> improved it from 14ms to 1ms.

I made the patch based on your advice and re-run the test on the new machine.
(The attached patch is just for test purpose.)

Wouldn't it be better to explicitly initialize the pointer with NULL?
I think it's common in Postgres.

static XLogReaderState *xlogreader = NULL;

* foreign_twophase_commit = disabled
2686tps

* foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
311tps

* foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
2057tps

Nice results.

regards,

Ranier Vilela

RE: Transactions involving multiple postgres foreign servers, take 2

From

"r.takahashi_2@fujitsu.com"

Date:

13 July 2021, 21:34:19

Hi,


> Wouldn't it be better to explicitly initialize the pointer with NULL?

Thank you for your advice.
You are correct.

Anyway, I fixed it and re-run the performance test, it of course does not affect tps.

Regards,
Ryohei Takahashi

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

14 July 2021, 07:52:41

On Tue, Jul 13, 2021 at 1:14 PM r.takahashi_2@fujitsu.com
<r.takahashi_2@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
>
> Thank you for your reply.
>
> > Not sure but it might be possible to keep holding an xlogreader for
> > reading PREPARE WAL records even after the transaction commit. But I
> > wonder how much open() for wal segment file accounts for the total
> > execution time of 2PC. 2PC requires 2 network round trips for each
> > participant. For example, if it took 500ms in total, we would not get
> > benefits much from the point of view of 2PC performance even if we
> > improved it from 14ms to 1ms.
>
> I made the patch based on your advice and re-run the test on the new machine.
> (The attached patch is just for test purpose.)

Thank you for testing!

>
>
> * foreign_twophase_commit = disabled
> 2686tps
>
> * foreign_twophase_commit = required (It is necessary to set -R ${RATE} as Ikeda-san said)
> 311tps
>
> * foreign_twophase_commit = required with attached patch (It is not necessary to set -R ${RATE})
> 2057tps

Nice improvement!

BTW did you test on the local? That is, the foreign servers are
located on the same machine?

>
>
> This indicate that if we can reduce the number of times to open() wal segment file during "COMMIT PREPARED", the
performancecan be improved.
 
>
> This patch can skip closing wal segment file, but I don't know when we should close.
> One idea is to close when the wal segment file is recycled, but it seems difficult for backend process to do so.

I guess it would be better to start a new thread for this improvement.
This idea helps not only 2PC case but also improves the
COMMIT/ROLLBACK PREPARED performance itself. Rather than thinking it
tied with this patch, I think it's good if we can discuss this patch
separately and it gets committed alone.

> BTW, in previous discussion, "Send COMMIT PREPARED remote servers in bulk" is proposed.
> I imagined the new SQL interface like "COMMIT PREPARED 'prep_1', 'prep_2', ... 'prep_n'".
> If we can open wal segment file during bulk COMMIT PREPARED, we can not only reduce the times of communication, but
alsoreduce the times of open() wal segment file.
 

What if we successfully committed 'prep_1' but an error happened
during committing another one for some reason (i.g., corrupted 2PC
state file, OOM etc)? We might return an error to the client but have
already committed 'prep_1'.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Transactions involving multiple postgres foreign servers, take 2

From

"r.takahashi_2@fujitsu.com"

Date:

15 July 2021, 09:25:16

Hi Sawada-san,


Thank you for your reply.

> BTW did you test on the local? That is, the foreign servers are
> located on the same machine?

Yes, I tested on the local since I cannot prepare the good network now.


> I guess it would be better to start a new thread for this improvement.

Thank you for your advice.
I started a new thread [1].


> What if we successfully committed 'prep_1' but an error happened
> during committing another one for some reason (i.g., corrupted 2PC
> state file, OOM etc)? We might return an error to the client but have
> already committed 'prep_1'.

Sorry, I don't have good idea now.
I imagined the command returns the list of the transaction id which ends with error.


[1]
https://www.postgresql.org/message-id/OS0PR01MB56828019B25CD5190AB6093282129%40OS0PR01MB5682.jpnprd01.prod.outlook.com


Regards,
Ryohei Takahashi

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

15 July 2021, 15:38:27

On 2021/07/09 22:44, Masahiko Sawada wrote:
> On Fri, Jul 9, 2021 at 3:26 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>> As far as I read the code, keep using old API for foreign subtransaction doesn't
>> cause any actual bug. But it's just strange and half-baked to manage top and
>> sub transaction in the differenet layer and to use old and new API for them.
> 
> That's a valid concern. I'm really not sure what we should do here but
> I guess that even if we want to support subscriptions we have another
> API dedicated for subtransaction commit and rollback.
Ok, so if possible I will write POC patch for new API for foreign subtransactions
and consider whether it's enough simple that we can commit into core or not.

+#define FDWXACT_FLAG_PARALLEL_WORKER    0x02    /* is parallel worker? */

This implies that parallel workers may execute PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED to the foreign server for atomic commit?
If so, what happens if the PREPARE TRANSACTION that one of
parallel workers issues fails? In this case, not only that parallel worker
but also other parallel workers and the leader should rollback the transaction
at all. That is, they should issue ROLLBACK PREPARED to the foreign servers.
This issue was already handled and addressed in the patches?

This seems not actual issue if only postgres_fdw is used. Because postgres_fdw
doesn't have IsForeignScanParallelSafe API. Right? But what about other FDW?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"k.jamison@fujitsu.com"

Date:

05 October 2021, 00:55:59

Hi Sawada-san,

I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
I find the feature (atomic commit for foreign transactions) very useful
and it will pave the road for having a distributed transaction management in Postgres.
Although we have not arrived at consensus at which approach is best,
there were significant reviews and major patch changes in the past 2 years.
By any chance, do you have any plans to continue this from where you left off?

Regards,
Kirk Jamison

Re: Transactions involving multiple postgres foreign servers, take 2

From

Masahiko Sawada

Date:

05 October 2021, 01:38:57

Hi,

On Tue, Oct 5, 2021 at 9:56 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> Hi Sawada-san,
>
> I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
> I find the feature (atomic commit for foreign transactions) very useful
> and it will pave the road for having a distributed transaction management in Postgres.
> Although we have not arrived at consensus at which approach is best,
> there were significant reviews and major patch changes in the past 2 years.
> By any chance, do you have any plans to continue this from where you left off?

As I could not reply to the review comments from Fujii-san for almost
three months, I don't have enough time to move this project forward at
least for now. That's why I marked this patch as RWF. I’d like to
continue working on this project in my spare time but I know this is
not a project that can be completed by using only my spare time. If
someone wants to work on this project, I’d appreciate it and am happy
to help.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

06 October 2021, 03:03:49

On 2021/10/05 10:38, Masahiko Sawada wrote:
> Hi,
> 
> On Tue, Oct 5, 2021 at 9:56 AM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>>
>> Hi Sawada-san,
>>
>> I noticed that this thread and its set of patches have been marked with "Returned with Feedback" by yourself.
>> I find the feature (atomic commit for foreign transactions) very useful
>> and it will pave the road for having a distributed transaction management in Postgres.
>> Although we have not arrived at consensus at which approach is best,
>> there were significant reviews and major patch changes in the past 2 years.
>> By any chance, do you have any plans to continue this from where you left off?
> 
> As I could not reply to the review comments from Fujii-san for almost
> three months, I don't have enough time to move this project forward at
> least for now. That's why I marked this patch as RWF. I’d like to
> continue working on this project in my spare time but I know this is
> not a project that can be completed by using only my spare time. If
> someone wants to work on this project, I’d appreciate it and am happy
> to help.

Probably it's time to rethink the approach. The patch introduces
foreign transaction manager into PostgreSQL core, but as far as
I review the patch, its changes look overkill and too complicated.
This seems one of reasons why we could not have yet committed
the feature even after several years.

Another concern about the approach of the patch is that it needs
to change a backend so that it additionally waits for replication
during commit phase before executing PREPARE TRANSACTION
to foreign servers. Which would decrease the performance
during commit phase furthermore.

So I wonder if it's worth revisiting the original approach, i.e.,
add the atomic commit into postgres_fdw. One disadvantage of
this is that it supports atomic commit only between foreign
PostgreSQL servers, not other various data resources like MySQL.
But I'm not sure if we really want to do atomic commit between
various FDWs. Maybe supporting only postgres_fdw is enough
for most users. Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Transactions involving multiple postgres foreign servers, take 2

From

"k.jamison@fujitsu.com"

Date:

07 October 2021, 04:28:57

Hi Fujii-san and Sawada-san,

Thank you very much for your replies.

> >> I noticed that this thread and its set of patches have been marked with
> "Returned with Feedback" by yourself.
> >> I find the feature (atomic commit for foreign transactions) very
> >> useful and it will pave the road for having a distributed transaction
> management in Postgres.
> >> Although we have not arrived at consensus at which approach is best,
> >> there were significant reviews and major patch changes in the past 2 years.
> >> By any chance, do you have any plans to continue this from where you left off?
> >
> > As I could not reply to the review comments from Fujii-san for almost
> > three months, I don't have enough time to move this project forward at
> > least for now. That's why I marked this patch as RWF. I’d like to
> > continue working on this project in my spare time but I know this is
> > not a project that can be completed by using only my spare time. If
> > someone wants to work on this project, I’d appreciate it and am happy
> > to help.
> 
> Probably it's time to rethink the approach. The patch introduces foreign
> transaction manager into PostgreSQL core, but as far as I review the patch, its
> changes look overkill and too complicated.
> This seems one of reasons why we could not have yet committed the feature even
> after several years.
> 
> Another concern about the approach of the patch is that it needs to change a
> backend so that it additionally waits for replication during commit phase before
> executing PREPARE TRANSACTION to foreign servers. Which would decrease the
> performance during commit phase furthermore.
> 
> So I wonder if it's worth revisiting the original approach, i.e., add the atomic
> commit into postgres_fdw. One disadvantage of this is that it supports atomic
> commit only between foreign PostgreSQL servers, not other various data
> resources like MySQL.
> But I'm not sure if we really want to do atomic commit between various FDWs.
> Maybe supporting only postgres_fdw is enough for most users. Thought?

The intention of Sawada-san's patch is grand although would be very much helpful
because it accommodates possible future support of atomic commit for
various types of FDWs. However, it's difficult to get the agreement altogether,
as other reviewers also point out the performance of commit. Another point is that
how it should work when we also implement atomic visibility (which is another
topic for distributed transactions but worth considering).
That said, if we're going to initially support it on postgres_fdw, which is simpler 
than the latest patches, we need to ensure that abnormalities and errors
are properly handled and prove that commit performance can be improved,
e.g. if we can commit not in serial but also possible in parallel.
And if possible, although not necessary during the first step, it may put at ease
the other reviewers if can we also think of the image on how to implement atomic
visibility on postgres_fdw. 
Thoughts?

Regards,
Kirk Jamison

Re: Transactions involving multiple postgres foreign servers, take 2

From

Etsuro Fujita

Date:

07 October 2021, 10:47:44

Hi,

On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
> That said, if we're going to initially support it on postgres_fdw, which is simpler
> than the latest patches, we need to ensure that abnormalities and errors
> are properly handled and prove that commit performance can be improved,
> e.g. if we can commit not in serial but also possible in parallel.

If it's ok with you, I'd like to work on the performance issue.  What
I have in mind is commit all remote transactions in parallel instead
of sequentially in the postgres_fdw transaction callback, as mentioned
above, but I think that would improve the performance even for
one-phase commit that we already have.  Maybe I'm missing something,
though.

Best regards,
Etsuro Fujita

Re: Transactions involving multiple postgres foreign servers, take 2

From

Fujii Masao

Date:

07 October 2021, 14:37:20

On 2021/10/07 19:47, Etsuro Fujita wrote:
> Hi,
> 
> On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
>> That said, if we're going to initially support it on postgres_fdw, which is simpler
>> than the latest patches, we need to ensure that abnormalities and errors
>> are properly handled

Yes. One idea for this is to include the information required to resolve
outstanding prepared transactions, in the transaction identifier that
PREPARE TRANSACTION command uses. For example, we can use the XID of
local transaction and the cluster ID of local server (e.g., cluster_name
that users specify uniquely can be used for that) as that information.
If the cluster_name of local server is "server1" and its XID is now 9999,
postgres_fdw issues "PREPARE TRANSACTION 'server1_9999'" and
"COMMIT PREPARED 'server1_9999'" to the foreign servers, to end those
foreign transactions in two-phase way.

If some troubles happen, the prepared transaction with "server1_9999"
may remain unexpectedly in one foreign server. In this case we can
determine whether to commit or rollback that outstanding transaction
by checking whether the past transaction with XID 9999 was committed
or rollbacked in the server "server1". If it's committed, the prepared
transaction also should be committed, so we should execute
"COMMIT PREPARED 'server1_9999'". If it's rollbacked, the prepared
transaction also should be rollbacked. If it's in progress, we should
do nothing for that transaction.

pg_xact_status() can be used to check whether the transaction with
the specified XID was committed or rollbacked. But pg_xact_status()
can return invalid result if CLOG data for the specified XID has been
truncated by VACUUM FREEZE. To handle this case, we might need
the special table tracking the transaction status.

DBA can use the above procedure and manually resolve the outstanding
prepared transactions in foreign servers. Also probably we can implement
the function doing the procedure. If so, it might be good idea to make
background worker or cron periodically execute the function.

>> and prove that commit performance can be improved,
>> e.g. if we can commit not in serial but also possible in parallel.
> 
> If it's ok with you, I'd like to work on the performance issue.  What
> I have in mind is commit all remote transactions in parallel instead
> of sequentially in the postgres_fdw transaction callback, as mentioned
> above, but I think that would improve the performance even for
> one-phase commit that we already have.

+100

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Transactions involving multiple postgres foreign servers, take 2

From

Etsuro Fujita

Date:

10 October 2021, 08:12:48

Fujii-san,

On Thu, Oct 7, 2021 at 11:37 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> On 2021/10/07 19:47, Etsuro Fujita wrote:
> > On Thu, Oct 7, 2021 at 1:29 PM k.jamison@fujitsu.com
> > <k.jamison@fujitsu.com> wrote:
> >> and prove that commit performance can be improved,
> >> e.g. if we can commit not in serial but also possible in parallel.
> >
> > If it's ok with you, I'd like to work on the performance issue.  What
> > I have in mind is commit all remote transactions in parallel instead
> > of sequentially in the postgres_fdw transaction callback, as mentioned
> > above, but I think that would improve the performance even for
> > one-phase commit that we already have.
>
> +100

I’ve started working on this.  Once I have a (POC) patch, I’ll post it
in a new thread, as I think it can be discussed separately.

Thanks!

Best regards,
Etsuro Fujita