Re: Transactions involving multiple postgres foreign servers, take 2 - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Transactions involving multiple postgres foreign servers, take 2
Date
Msg-id CA+fd4k5-2fKKnOgmcm42XWE6mtkrQ2DNtub6T-5oTmiJmmBmNg@mail.gmail.com
Whole thread Raw
In response to Re: Transactions involving multiple postgres foreign servers, take 2  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Fri, 12 Jun 2020 at 19:24, Amit Kapila <amit.kapila16@gmail.com> wrote:
>

Thank you for your reviews on 0003 patch. I've incorporated your
comments. I'll submit the latest version patch later as the design or
scope might change as a result of the discussion.

>
> Few more comments on v22-0003-Documentation-update
> --------------------------------------------------------------------------------------
> 1.
> +          When <literal>disabled</literal> there can be risk of database
> +          consistency among all servers that involved in the distributed
> +          transaction when some foreign server crashes during committing the
> +          distributed transaction.
>
> Will it read better if rephrase above to something like: "When
> <literal>disabled</literal> there can be a risk of database
> consistency if one or more foreign servers crashes while committing
> the distributed transaction."?

Fixed.

>
> 2.
> +      <varlistentry
> id="guc-foreign-transaction-resolution-rety-interval"
> xreflabel="foreign_transaction_resolution_retry_interval">
> +       <term><varname>foreign_transaction_resolution_retry_interval</varname>
> (<type>integer</type>)
> +        <indexterm>
> +         <primary><varname>foreign_transaction_resolution_interval</varname>
> configuration parameter</primary>
> +        </indexterm>
> +       </term>
> +       <listitem>
> +        <para>
> +         Specify how long the foreign transaction resolver should
> wait when the last resolution
> +         fails before retrying to resolve foreign transaction. This
> parameter can only be set in the
> +         <filename>postgresql.conf</filename> file or on the server
> command line.
> +        </para>
> +        <para>
> +         The default value is 10 seconds.
> +        </para>
> +       </listitem>
> +      </varlistentry>
>
> Typo.  <varlistentry
> id="guc-foreign-transaction-resolution-rety-interval", spelling of
> retry is wrong.  Do we really need such a guc parameter?  I think we
> can come up with some simple algorithm to retry after a few seconds
> and then increase that interval of retry if we fail again or something
> like that.  I don't know how users can come up with some non-default
> value for this variable.

For example, in a low-reliable network environment, setting lower
value would help to minimize the backend wait time in case of
connection lost. But I also agree with your point. In terms of
implementation, having backends wait for the fixed time is more simple
but we can do such incremental interval by remembering the retry count
for each foreign transaction.

An open question regarding retrying foreign transaction resolution is
how we process the case where an involved foreign server is down for a
very long. If an online transaction is waiting to be resolved, there
is no way to exit from the wait loop other than either the user sends
a cancel request or the crashed server is restored. But if the foreign
server has to be down for a long time, I think it’s not practical to
send a cancel request because the client would need something like a
timeout mechanism. So I think it might be better to provide a way to
cancel the waiting without the user sending a cancel. For example,
having a timeout or having the limit of the retry count. If an
in-doubt transaction is waiting to be resolved, we keep trying to
resolve the foreign transaction at an interval. But I wonder if the
user might want to disable the automatic in-doubt foreign transaction
in some cases, for example, where the user knows the crashed server
will not be restored for a long time. I’m thinking that we can provide
a way to disable automatic foreign transaction resolution or disable
it for the particular foreign transaction.

>
> 3
> +      <varlistentry id="guc-foreign-transaction-resolver-timeout"
> xreflabel="foreign_transaction_resolver_timeout">
> +       <term><varname>foreign_transaction_resolver_timeout</varname>
> (<type>integer</type>)
> +        <indexterm>
> +         <primary><varname>foreign_transaction_resolver_timeout</varname>
> configuration parameter</primary>
> +        </indexterm>
> +       </term>
> +       <listitem>
> +        <para>
> +         Terminate foreign transaction resolver processes that don't
> have any foreign
> +         transactions to resolve longer than the specified number of
> milliseconds.
> +         A value of zero disables the timeout mechanism, meaning it
> connects to one
> +         database until stopping manually.
>
> Can we mention the function name using which one can stop the resolver process?

Fixed.

>
> 4.
> +   Using the <productname>PostgreSQL</productname>'s atomic commit ensures that
> +   all changes on foreign servers end in either commit or rollback using the
> +   transaction callback routines
>
> Can we slightly rephase this "Using the PostgreSQL's atomic commit
> ensures that all the changes on foreign servers are either committed
> or rolled back using the transaction callback routines"?

Fixed.

>
> 5.
> +       Prepare all transactions on foreign servers.
> +       <productname>PostgreSQL</productname> distributed transaction manager
> +       prepares all transaction on the foreign servers if two-phase commit is
> +       required. Two-phase commit is required when the transaction modifies
> +       data on two or more servers including the local server itself and
> +       <xref linkend="guc-foreign-twophase-commit"/> is
> +       <literal>required</literal>.
>
> /PostgreSQL/PostgreSQL's.

Fixed.

>
>  If all preparations on foreign servers got
> +       successful go to the next step.
>
> How about "If the prepare on all foreign servers is successful then go
> to the next step"?

Fixed.

>
>  Any failure happens in this step,
> +       the server changes to rollback, then rollback all transactions on both
> +       local and foreign servers.
>
> Can we rephrase this line to something like: "If there is any failure
> in the prepare phase, the server will rollback all the transactions on
> both local and foreign servers."?

Fixed.

>
> What if the issued Rollback also failed, say due to network breakdown
> between local and one of foreign servers?  Shouldn't such a
> transaction be 'in-doubt' state?

Rollback API to rollback transaction in one-phase can be called
recursively. So FDWs have to tolerate recursive calling.

In the current patch, all transaction operations are performed
synchronously. That is, foreign transaction never becomes in-doubt
state without explicit cancel by the user or the local node crash.
That way, subsequent transactions can assume that precedent
distributed transactions are already resolved unless the user
canceled.

Let me explain the details:

If the transaction turns rollback due to failure before the local
commit, we attempt to do both ROLLBACK and ROLLBACK PREPARED against
foreign transactions whose status is PREPARING. That is, we end the
foreign transactions by doing ROLLBACK. And since we're not sure
preparation has been completed on the foreign server the backend asks
the resolver process for doing ROLLBACK PREPARED on the foreign
servers. Therefore FDWs have to tolerate OBJECT_NOT_FOUND error in
abort case. Since the backend process returns an acknowledgment to the
client only after rolling back all foreign transactions, these foreign
transactional don't remain as in-doubt state.

If rolling back failed after the local commit (i.g., the client does
ROLLBACK and the resolver failed to do ROLLBACK PREPARED), a resolver
process will relaunch and retry to do ROLLBACK PREPARED. The backend
process waits until ROLLBACK PREPARED is successfully done or the user
cancels. So the foreign transactions don't become in-doubt
transactions.

Synchronousness is also an open question. If we want to support atomic
commit in an asynchronous manner it might be better to implement it
first in terms of complexity. The backend returns an acknowledgment to
the client immediately after asking the resolver process. It’s known
as the early acknowledgment technique. The downside is that the user
who wants to see the result of precedent transaction needs to make
sure the precedent transaction is committed on all foreign servers. We
will also need to think about how to control it by GUC parameter when
we have synchronous distributed transaction commit. Perhaps it’s
better to control it independent of synchronous replication.

>
> 6.
> +      <para>
> +       Commit locally. The server commits transaction locally.  Any
> failure happens
> +       in this step the server changes to rollback, then rollback all
> transactions
> +       on both local and foreign servers.
> +      </para>
> +     </listitem>
> +     <listitem>
> +      <para>
> +       Resolve all prepared transaction on foreign servers. Pprepared
> transactions
> +       are committed or rolled back according to the result of the
> local transaction.
> +       This step is normally performed by a foreign transaction
> resolver process.
> +      </para>
>
> When (in which step) do we commit on foreign servers?  Do Resolver
> processes commit on foreign servers, if so, how can we commit locally
> without committing on foreign servers, what if the commit on one of
> the servers fails? It is not very clear to me from the steps mentioned
> here?

In case 2pc is required, we commit transactions on foreign servers at
the final step by the resolver process. If the committing a prepared
transaction on one of the servers fails, a resolver process relaunches
after an interval and retry to commit.

In case 2pc is not required, we commit transactions on foreign servers
at pre-commit phase by the backend.

> Typo, /Pprepared/Prepared

Fixed.

>
> 7.
> However, foreign transactions
> +    become <firstterm>in-doubt</firstterm> in three cases: where the foreign
> +    server crashed or lost the connectibility to it during preparing foreign
> +    transaction, where the local node crashed during either preparing or
> +    resolving foreign transaction and where user canceled the query.
>
> Here the three cases are not very clear.  You might want to use (a)
> ..., (b) .. ,(c)..

Fixed. I change it to itemizedlist.

> Also, I think the state will be in-doubt even when
> we lost connection to server during commit or rollback.

Let me correct the cases of the foreign transactions remain as
in-doubt state. There are two cases:

* The local node crashed
* The user canceled the transaction commit or rollback.

Even when we lost connection to the server during commit or rollback
prepared transaction, a backend doesn’t return an acknowledgment to
the client until either transaction is successfully resolved, the user
cancels the transaction, or the local node crashes.

>
> 8.
> +    One foreign transaction resolver is responsible for transaction resolutions
> +    on which one database connecting.
>
> Can we rephrase it to: "One foreign transaction resolver is
> responsible for transaction resolutions on the database to which it is
> connected."?

Fixed.

>
> 9.
> +    Note that other <productname>PostgreSQL</productname> feature
> such as parallel
> +    queries, logical replication, etc., also take worker slots from
> +    <varname>max_worker_processes</varname>.
>
> /feature/features

Fixed.

>
> 10.
> +   <para>
> +    Atomic commit requires several configuration options to be set.
> +    On the local node, <xref
> linkend="guc-max-prepared-foreign-transactions"/> and
> +    <xref linkend="guc-max-foreign-transaction-resolvers"/> must be
> non-zero value.
> +    Additionally the <varname>max_worker_processes</varname> may need
> to be adjusted to
> +    accommodate for foreign transaction resolver workers, at least
> +    (<varname>max_foreign_transaction_resolvers</varname> +
> <literal>1</literal>).
> +    Note that other <productname>PostgreSQL</productname> feature
> such as parallel
> +    queries, logical replication, etc., also take worker slots from
> +    <varname>max_worker_processes</varname>.
> +   </para>
>
> Don't we need to mention foreign_twophase_commit GUC here?

Fixed.

>
> 11.
> +   <sect2 id="fdw-callbacks-transaction-managements">
> +    <title>FDW Routines For Transaction Managements</title>
>
> Managements/Management?

Fixed.

>
> 12.
> +     Transaction management callbacks are used for doing commit, rollback and
> +     prepare the foreign transaction.
>
> Lets write the above sentence as: "Transaction management callbacks
> are used to commit, rollback and prepare the foreign transaction."

Fixed.

>
> 13.
> +    <para>
> +     Transaction management callbacks are used for doing commit, rollback and
> +     prepare the foreign transaction. If an FDW wishes that its foreign
> +     transaction is managed by <productname>PostgreSQL</productname>'s global
> +     transaction manager it must provide both
> +     <function>CommitForeignTransaction</function> and
> +     <function>RollbackForeignTransaction</function>. In addition, if an FDW
> +     wishes to support <firstterm>atomic commit</firstterm> (as described in
> +     <xref linkend="fdw-transaction-managements"/>), it must provide
> +     <function>PrepareForeignTransaction</function> as well and can provide
> +     <function>GetPrepareId</function> callback optionally.
> +    </para>
>
> What exact functionality a FDW can accomplish if it just supports
> CommitForeignTransaction and RollbackForeignTransaction?  It seems it
> doesn't care for 2PC, if so, is there any special functionality we can
> achieve with this which we can't do without these APIs?

There is no special functionality even if an FDW implements
CommitForeignTrasnaction and RollbackForeignTransaction. Currently,
since there is no transaction API in FDW APIs, FDW developer has to
use XactCallback to control transactions but there is no
documentation. The idea of allowing an FDW to support only
CommitForeignTrasnaction and RollbackForeignTransaction is that FDW
developers can implement transaction management easily. But in the
first patch, we also can disallow it to make the implementation
simple.

>
> 14.
> +PrepareForeignTransaction(FdwXactRslvState *frstate);
> +</programlisting>
> +    Prepare the transaction on the foreign server. This function is
> called at the
> +    pre-commit phase of the local transactions if foreign twophase commit is
> +    required. This function is used only for distribute transaction management
> +    (see <xref linkend="distributed-transaction"/>).
> +    </para>
>
> /distribute/distributed

Fixed.

>
> 15.
> +   <sect2 id="fdw-transaction-commit-rollback">
> +    <title>Commit And Rollback Single Foreign Transaction</title>
> +    <para>
> +     The FDW callback function <literal>CommitForeignTransaction</literal>
> +     and <literal>RollbackForeignTransaction</literal> can be used to commit
> +     and rollback the foreign transaction. During transaction commit, the core
> +     transaction manager calls
> <literal>CommitForeignTransaction</literal> function
> +     in the pre-commit phase and calls
> +     <literal>RollbackForeignTransaction</literal> function in the
> post-rollback
> +     phase.
> +    </para>
>
> There is no reasoning mentioned as to why CommitForeignTransaction has
> to be called in pre-commit phase and RollbackForeignTransaction in
> post-rollback phase?  Basically why one in pre phase and other in post
> phase?

Good point. This behavior just follows what postgres_fdw does. I'm not
sure the exact reason why postgres_fdw commit the transaction in
pre-commit phase but I guess the committing a foreign transaction is
likely to abort comparing to the local commit, it might be better to
do first.

>
> 16.
> +       <entry>
> +        <literal><function>pg_remove_foreign_xact(<parameter>transaction</parameter>
> <type>xid</type>, <parameter>serverid</parameter> <type>oid</type>,
> <parameter>userid</parameter> <type>oid</type>)</function></literal>
> +       </entry>
> +       <entry><type>void</type></entry>
> +       <entry>
> +        This function works the same as
> <function>pg_resolve_foreign_xact</function>
> +        except that this removes the foreign transcation entry
> without resolution.
> +       </entry>
>
> Can we write why and when such a function can be used?  Typo,
> /trasnaction/transaction

Fixed.

>
> 17.
> +     <row>
> +      <entry><literal>FdwXactResolutionLock</literal></entry>
> +      <entry>Waiting to read or update information of foreign trasnaction
> +       resolution.</entry>
> +     </row>
>
> /trasnaction/transaction

Fixed.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Infinities in type numeric
Next
From: Tom Lane
Date:
Subject: Re: snowball ASCII stemmer configuration