Re: Global snapshots - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Global snapshots
Date
Msg-id CA+fd4k6HE8xLGEvqWzABEg8kkju5MxU+if7bf-md0_2pjzXp9Q@mail.gmail.com
Whole thread Raw
In response to Re: Global snapshots  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Global snapshots
List pgsql-hackers
On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > > <a.lepikhov@postgrespro.ru> wrote:
> > >
> > > >    Also, can you let us know if this
> > > > > supports 2PC in some way and if so how is it different from what the
> > > > > other thread on the same topic [1] is trying to achieve?
> > > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > > 2PC machinery. Now I'd not judge which approach is better.
> > > >
> > >
> >
> > Sorry for being late.
> >
>
> No problem, your summarization, and comparisons of both approaches are
> quite helpful.
>
> >
> > I studied this patch and did a simple comparison between this patch
> > (0002 patch) and my 2PC patch.
> >
> > In terms of atomic commit, the features that are not implemented in
> > this patch but in the 2PC patch are:
> >
> > * Crash safe.
> > * PREPARE TRANSACTION command support.
> > * Query cancel during waiting for the commit.
> > * Automatically in-doubt transaction resolution.
> >
> > On the other hand, the feature that is implemented in this patch but
> > not in the 2PC patch is:
> >
> > * Executing PREPARE TRANSACTION (and other commands) in parallel
> >
> > When the 2PC patch was proposed, IIRC it was like this patch (0002
> > patch). I mean, it changed only postgres_fdw to support 2PC. But after
> > discussion, we changed the approach to have the core manage foreign
> > transaction for crash-safe. From my perspective, this patch has a
> > minimum implementation of 2PC to work the global snapshot feature and
> > has some missing features important for supporting crash-safe atomic
> > commit. So I personally think we should consider how to integrate this
> > global snapshot feature with the 2PC patch, rather than improving this
> > patch if we want crash-safe atomic commit.
> >
>
> Okay, but isn't there some advantage with this approach (manage 2PC at
> postgres_fdw level) as well which is that any node will be capable of
> handling global transactions rather than doing them via central
> coordinator?  I mean any node can do writes or reads rather than
> probably routing them (at least writes) via coordinator node.

The postgres server where the client started the transaction works as
the coordinator node. I think this is true for both this patch and
that 2PC patch. From the perspective of atomic commit, any node will
be capable of handling global transactions in both approaches.

>  Now, I
> agree that even if this advantage is there in the current approach, we
> can't lose the crash-safety aspect of other approach.  Will you be
> able to summarize what was the problem w.r.t crash-safety and how your
> patch has dealt it?

Since this patch proceeds 2PC without any logging, foreign
transactions prepared on foreign servers are left over without any
clues if the coordinator crashes during commit. Therefore, after
restart, the user will need to find and resolve in-doubt foreign
transactions manually.

In that 2PC patch, the information of foreign transactions is WAL
logged before PREPARE TRANSACTION. So even if the coordinator crashes
after preparing some foreign transactions, the prepared foreign
transactions are recovered during crash recovery, and then the
transaction resolver resolves them automatically or the user also can
resolve them. The user doesn't need to check other participants node
to resolve in-doubt foreign transactions. Also, since the foreign
transaction information is replicated to physical standbys the new
master can take over resolving in-doubt transactions.

>
> > Looking at the commit procedure with this patch:
> >
> > When starting a new transaction on a foreign server, postgres_fdw
> > executes pg_global_snapshot_import() to import the global snapshot.
> > After some work, in pre-commit phase we do:
> >
> > 1. generate global transaction id, say 'gid'
> > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > 3. prepare global snapshot locally, if the local node also involves
> > the transaction
> > 4. execute pg_global_snapshot_prepare('gid') for all participants
> >
> > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > returned from each pg_global_snapshot_prepare() executions.
> >
> > 5. assign global snapshot locally, if the local node also involves the
> > transaction
> > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> >
> > Then, we commit locally (i.g. mark the current transaction as
> > committed in clog).
> >
> > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > participants.
> >
>
> As per my current understanding, the overall idea is as follows.  For
> global transactions, pg_global_snapshot_prepare('gid') will set the
> transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> at the node where that function is executed, it also returns the
> NodeCSN to the coordinator.  Then the coordinator (the current
> postgres_fdw node on which write transaction is being executed)
> computes MaxCSN based on the return value (NodeCSN) of prepare
> (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> to each node.  Finally, when Commit Prepared is issued for each node
> that MaxCSN will be written to each node including the current node.
> So, with this idea, each node will have the same view of CSN value
> corresponding to any particular transaction.
>
> For Snapshot management, the node which receives the query generates a
> CSN (CurrentCSN) and follows the simple rule that the tuple having a
> xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> possible that when we are examining a tuple, the CSN corresponding to
> xid that has written the tuple has a value as INDOUBT which will
> indicate that the transaction is yet not committed on all nodes.  And
> we wait till we get the valid CSN value corresponding to xid and then
> use it to check if the tuple is visible.
>
> Now, one thing to note here is that for global transactions we
> primarily rely on CSN value corresponding to a transaction for its
> visibility even though we still maintain CLOG for local transaction
> status.
>
> Leaving aside the incomplete parts and or flaws of the current patch,
> does the above match the top-level idea of this patch?

I'm still studying this patch but your understanding seems right to me.

> I am not sure
> if my understanding of this patch at this stage is completely correct
> or whether we want to follow the approach of this patch but I think at
> least lets first be sure if such a top-level idea can achieve what we
> want to do here.
>
> > Considering how to integrate this global snapshot feature with the 2PC
> > patch, what the 2PC patch needs to at least change is to allow FDW to
> > store an FDW-private data that is passed to subsequent FDW transaction
> > API calls. Currently, in the current 2PC patch, we call Prepare API
> > for each participant servers one by one, and the core pass only
> > metadata such as ForeignServer, UserMapping, and global transaction
> > identifier. So it's not easy to calculate the maximum CSN across
> > multiple transaction API calls. I think we can change the 2PC patch to
> > add a void pointer into FdwXactRslvState, struct passed from the core,
> > in order to store FDW-private data. It's going to be the maximum CSN
> > in this case. That way, at the first Prepare API calls postgres_fdw
> > allocates the space and stores CSN to that space. And at subsequent
> > Prepare API calls it can calculate the maximum of csn, and then is
> > able to the step 3 to 6 when preparing the transaction on the last
> > participant. Another idea would be to change 2PC patch so that the
> > core passes a bunch of participants grouped by FDW.
> >
>
> IIUC with this the coordinator needs the communication with the nodes
> twice at the prepare stage, once to prepare the transaction in each
> node and get CSN from each node and then to communicate MaxCSN to each
> node?

Yes, I think so too.

> Also, we probably need InDoubt CSN status at prepare phase to
> make snapshots and global visibility work.

I think it depends on how global CSN feature works.

For instance, in that 2PC patch, if the coordinator crashes during
preparing a foreign transaction, the global transaction manager
recovers and regards it as "prepared" regardless of the foreign
transaction actually having been prepared. And it sends ROLLBACK
PREPARED after recovery completed. With global CSN patch, as you
mentioned, at prepare phase the coordinator needs to communicate
participants twice other than sending PREPARE TRANSACTION:
pg_global_snapshot_prepare() and pg_global_snapshot_assign().

If global CSN patch needs different cleanup work depending on the CSN
status, we will need InDoubt CSN status so that the global transaction
manager can distinguish between a foreign transaction that has
executed pg_global_snapshot_prepare() and the one that has executed
pg_global_snapshot_assign().

On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
PREPARED in that case, I think we don't need InDoubt CSN status. There
is no difference between those foreign transactions from the global
transaction manager perspective.

As far as I read the patch, on failure postgres_fdw simply send
ROLLBACK PREPARED to participants, and there seems no additional work
other than that. I might be missing something.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Amul Sul
Date:
Subject: Re: Proposal: Automatic partition creation
Next
From: Masahiko Sawada
Date:
Subject: Re: Resetting spilled txn statistics in pg_stat_replication