RE: Transactions involving multiple postgres foreign servers, take 2 - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject RE: Transactions involving multiple postgres foreign servers, take 2
Date
Msg-id TYCPR01MB6400482660931FED1ADF040CEF199@TYCPR01MB6400.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Transactions involving multiple postgres foreign servers, take 2  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Wed, June 30, 2021 10:06 (GMT+9), Masahiko Sawada wrote:
> I've attached the new version patch that incorporates the comments from
> Fujii-san and Ikeda-san I got so far. We launch a resolver process per foreign
> server, committing prepared foreign transactions on foreign servers in parallel.

Hi Sawada-san,
Thank you for the latest set of patches.
I've noticed from cfbot that the regression test failed, and I also could not compile it.

============== running regression test queries        ==============
test test_fdwxact                 ... FAILED       21 ms
============== shutting down postmaster               ==============
======================
 1 of 1 tests failed. 
======================

> To get a better performance based on the current architecture, we can have
> multiple resolver processes per foreign server but it seems not easy to tune it
> in practice. Perhaps is it better if we simply have a pool of resolver processes
> and we assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes as
> many as the concurrent backends using 2PC.

Yes, finding the right value to tune of of max_foreign_prepared_transactions and
max_prepared_transactions seem difficult. If we set the number of resolver
process to number of concurrent backends using 2PC, how do we determine
the value of max_foreign_transaction_resolvers? It might be good to set some
statistics to judge the value, then we can compare the performance from the V37
version.

-
Also, this is a bit of side topic, and I know we've been discussing how to 
improve/fix the resolver process bottlenecks, and Takahashi-san provided
the details above thread where V37 has problems. (I am joining the testing too.)

I am not sure if this has been brought up before because of the years of
thread. But I think that there is a need to consider the need to prevent for the
resolver process from an infinite wait loop of resolving a prepared foreign
transaction. Currently, when a crashed foreign server is recovered during
resolution retries, the information is recovered from WAL and files,
and the resolver process resumes the foreign transaction resolution.
However, what if we cannot (or intentionally do not want to) recover the
crashed server after a long time?

An idea is to make the resolver process to automatically stop after some
maximum number of retries.
We can call the parameter as foreign_transaction_resolution_max_retry_count.
There may be a better name, but I followed the pattern from your patch.

The server downtime can be estimated considering the proposed parameter
foreign_transaction_resolution_retry_interval (default 10s) from the
patch set.
In addition, according to docs, "a foreign server using the postgres_fdw
foreign data wrapper can have the same options that libpq accepts in
connection strings", so the connect_timeout set during CREATE SERVER can
also affect it.

Example:
    CREATE SERVER's connect_timeout setting = 5s
    foreign_transaction_resolution_retry_interval = 10s
    foreign_transaction_resolution_max_retry_count = 3

    Estimated total time before resolver stops: 
    = (5s) * (3 + 1) + (10s) * (3) = 50 s

00s: 1st connect start
05s: 1st connect timeout
(retry interval)
15s: 2nd connect start (1st retry)
20s: 2nd connect timeout
(retry interval)
30s: 3rd connect start (2nd retry)
35s: 3rd connect timeout
(retry interval)
45s: 4th connect start (3rd retry)
50s: 4th connect timeout
(resolver process stops)

Then the resolver process will not wait indefinitely and will stop after
some time depending on the setting of the above parameters.
This could be the automatic implementation of pg_stop_foreign_xact_resolver.
Assuming that resolver is stopped, then the crashed server is
decided to be restored, the user can then execute pg_resolve_foreign_xact().
Do you think the idea is feasible and we can add it as part of the patch sets?

Regards,
Kirk Jamison

pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Column Filtering in Logical Replication
Next
From: Bharath Rupireddy
Date:
Subject: Re: bugfix: when the blocksize is 32k, the function page_header of pageinspect returns negative numbers.