Thread: Global snapshots

Global snapshots

From
Stas Kelvich
Date:
Global snapshots
================

Here proposed a set of patches that allow achieving proper snapshot isolation
semantics in the case of cross-node transactions. Provided infrastructure to
synchronize snapshots between different Postgres instances and a way to
atomically commit such transaction with respect to other concurrent global and
local transactions. Such global transactions can be coordinated outside of
Postgres by using provided SQL functions or through postgres_fdw, which make use
of this functions on remote nodes transparently.


Background
----------

Several years ago was proposed extensible transaction manager (XTM) patch
[1,2,3] which allowed extensions to hook and override transaction-related
functions like a xid assignment, snapshot acquiring, transaction tree
commit/abort, etc. Also, two transaction management extensions were created
based on that API: pg_dtm, pg_tsdtm [4].

The first one, pg_dtm, was inspired by Postgres-XL and represents a direct
generalization of Postgres MVCC to the cross-node scenario: there is standalone
service (DTM) that maintains a xid counter and an array of running backends on
all nodes. Every node keeps an open connection to a DTM and asks for xids and
Postgres-style snapshots by a network call. While this approach is reasonable
for low transaction rates (which is common for OLAP-like load) it quickly gasps
for OLTP case.

The second one, pg_tsdtm, based on Clock-SI paper [5] implements distributed
snapshot synchronization protocol without the necessity of central service like
DTM. It makes use of CSN-based visibility and requires two steps of
communication during transaction commit. Since commit between nodes each of
which can abort transaction is usually done by 2PC protocol two steps of
communication are anyway already needed.

The current patch is a reimplementation of pg_tsdtm moved into core directly
without using any API. The decision to drop XTM API was based on two thoughts.
At first XTM API itself needed a huge pile of work to be done to became an
actual API instead of the set of hooks over current implementation of MVCC with
extensions responsible for handling random static variables like
RecentGlobalXmin and friend, which other guts make use of. For the second in all
of our test pg_tsdtm was better and faster, but that wasn't obvious from the
beginning of work on XTM/DTM. So we decided that it is better to focus on the
good implementation of Clock-SI approach.


Algorithm
---------

Clock-SI is described in [5] and here I provide a small overview, which
supposedly should be enough to catch the idea. Assume that each node runs Commit
Sequence Number (CSN) based visibility: database tracks one counter for each
transaction start (xid) and another counter for each transaction commit (csn).
In such setting, a snapshot is just a single number -- a copy of current CSN at
the moment when the snapshot was taken. Visibility rules are boiled down to
checking whether current tuple's CSN is less than our snapshot's csn. Also it
worth of mentioning that for the last 5 years there is an active proposal to
switch Postgres to CSN-based visibility [6].

Next, let's assume that CSN is current physical time on the node and call it
GlobalCSN. If the physical time on different nodes would be perfectly
synchronized then such snapshot exported on one node can be used on other nodes.
But unfortunately physical time never perfectly sync and can drift, so that fact
should be taken into mind. Also, there is no easy notion of lock or atomic
operation in the distributed environment, so commit atomicity on different nodes
with respect to concurrent snapshot acquisition should be handled somehow.
Clock-SI addresses that in following way:

1) To achieve commit atomicity of different nodes intermediate step is
   introduced: at first running transaction is marked as InDoubt on all nodes,
   and only after that each node commit it and stamps with a given GlobalCSN.
   All readers who ran into tuples of an InDoubt transaction should wait until
   it ends and recheck visibility. The same technique was used in [6] to achieve
   atomicity of subtransactions tree commit.

2) When coordinator is marking transactions as InDoubt on other nodes it
   collects ProposedGlobalCSN from each participant which is local time at that
   nodes. Next, it selects the maximal value of all ProposedGlobalCSN's and
   commits the transaction on all nodes with that maximal GlobaCSN. Even if that
   value is greater than current time on this node due to clock drift. So the
   GlobalCSN for the given transaction will be the same on all nodes.

3) When local transaction imports foreign global snapshot with some GlobalCSN
   and current time on this node is smaller then incoming GlobalCSN then the
   transaction should wait until this GlobalCSN time comes on the local clock.

Rules 2) and 3) provide protection against time drift. Paper [5] proves that
this is enough to guarantee a proper Snapshot Isolation.


Implementation
--------------

Main design decision of this patch is trying to affect the performance of local
transaction as low as possible while providing a way to make global
transactions. GUC variable track_global_snapshots enables/disables this feature.

Patch 0001-GlobalCSNLog introduces new SLRU instance that maps xids to
GlobalCSN. GlobalCSNLog code is pretty straightforward and more or less copied
from SUBTRANS log which is also not persistent. Also, I kept an eye on the
corresponding part of Heikki's original patch for CSN's in [6] and commit_ts.c.


Patch 0002-Global-snapshots provides infrastructure to snapshot sync functions
and global commit functions. It consists of several parts which would be enabled
when GUC track_global_snapshots is on:

* Each Postgres snapshot acquisition is accompanied by taking current GlobalCSN
  under the same shared ProcArray lock.

* Each transaction commit also writes current GlobalCSN into GlobalCSNLog. To
  avoid writes to SLRU under exclusive ProcArray lock (which would be the major
  hit on commit performance) trick with intermediate InDoubt state is used:
  before calling ProcArrayEndTransaction() backend writes InDoubt state in SLRU,
  then inside of ProcArrayEndTransaction() under a ProcArray lock GlobalCSN is
  assigned, and after the lock is released assigned GlobalCSN value is written
  to GlobalCSNLog SLRU. This approach ensures XIP-based snapshots and
  GlobalCSN-based are going to see the same subset of tuple versions without
  putting too much extra contention on ProcArray lock.

* XidInMVCCSnapshot can use both XIP-based and GlobalCSN-based snapshot. If the
  current transaction is local one then at first XIP-based check is performed,
  then if the tuple is visible don't do any further checks; if this xid is
  in-progress we need to fetch GlobalCSN from SLRU and recheck it with
  GlobalCSN-based visibility rules, as it may be part of global InDoubt
  transaction. So, for local transactions XidInMVCCSnapshot() will fetch SLRU
  entry only for in-progress transactions. This can be optimized further: it is
  possible to store a flag in a snapshot which indicates whether there were any
  active global transactions when this snapshot was taken. Then if there were no
  global transactions during snapshot creation SLRU access in
  XidInMVCCSnapshot() can be skipped at all. Otherwise, if current backend have
  the snapshot that was imported from foreign node then we use only
  GlobalCSN-based visibility rules (as we don't have any information about how
  XIP array looked like when that GlobalCSN was taken).

* Import/export routines provided for global snapshots. Export just returns
  current snapshot's global_csn. Import, on the other hand, is more complicated.
  Since imported GlobalCSN usually points in past we should hold OldestXmin and
  ensure that tuple versions for given GlobalCSN don't pass through OldestXmin
  boundary. To achieve that, mechanism like one in SnapshotTooOld is created: on
  each snapshot creation current oldestXmin is written to a sparse ring buffer
  which holds oldestXmin entries for a last global_snapshot_defer_time seconds.
  GetOldestXmin() is delaying its results to the oldest entry in this ring
  buffer. If we asked to import snapshot that is later then current_time -
  global_snapshot_defer_time then we just error out with "global snapshot too
  old" message. Otherwise, we have enough info to set proper xmin in our proc
  entry to defuse GetOldestXmin().

* Following routines for commit provided:
* pg_global_snaphot_prepare(gid) sets InDoubt state for given global tx and
  return proposed GlobalCSN.
* pg_global_snaphot_assign(gid, global_csn) assign given global_csn to given
  global tx. Consequent COMMIT PREPARED will use that.

Import/export and commit routines are made as SQL functions. IMO it's better to
transform them to IMPORT GLOBAL SNAPSHOT / EXPORT GLOBAL SNAPSHOT / PREPARE
GLOBAL / COMMIT GLOBAL PREPARED. But at this stage, I don't want to clutter this
patch with new grammar.


Patch 0003-postgres_fdw-support-for-global-snapshots uses the previous
infrastructure to achieve isolation in the generated transaction. Right now it
is a minimalistic implementation of 2PC like one in [7], but which doesn't care
about writing something about remote prepares in WAL on coordinator and doesn't
care about any recovery. The main usage is to test things in global snapshot
patch, as it easier to do with TAP tests over several postgres_fdw-connected
instances. Tests are included.


Usage
-----

The distributed transaction can be coordinated by an external coordinator. In
this case normal scenario would be following:

1) Start transaction on the first node, do some queries if needed, call
   pg_global_snaphot_export().

2) On the other node open transaction and call
   pg_global_snaphot_import(global_csn). global_csn is from previous export.

... do some useful work ...

3) Issue PREPARE for all participant.
4) Call pg_global_snaphot_prepare(gid) on all participant and store returned
   global_csn's;

select maximal over of returned global_csn's.

5) Call pg_global_snaphot_assign(gid, max_global_csn) on all participant.
6) Issue COMMIT PREPARED for all participant.

As it was said earlier steps 4) and 5) can be melded into 3) and 5)
respectively, but let's go for grammar changes after and if there will be
agreement on the overall concept.

postgres_fdw in 003-GlobalSnapshotsForPostgresFdw does the same thing, but
transparently to the user.


Tests
-----

Each XidInMVCCSnapshot() in this patch is coated with assert that checks that
same things are visible under XIP-based check and GlobalCSN one.

Patch 003-GlobalSnapshotsForPostgresFdw includes numerous variants of banking
test. It spin-offs several Postgres instances and several concurrent pgbenches
which simulate cross-node bank account transfers, while test constantly checks
that total balance is correct. Also, there are tests that imported snapshot is
going to see the same checksum of data as it was during import.

I think this feature also deserves separate test module that will wobble the
clock time, but that isn't done yet.


Attributions
------------
Original XTM, pg_dtm, pg_tsdtm were written by Konstantin Knizhnik, Constantin
Pan and me.
This version is mostly on me, with some inputs by Konstantin
Knizhnik and Arseny Sher.


[1] https://www.postgresql.org/message-id/flat/20160223164335.GA11285%40momjian.us
(originally thread was about fdw-based sharding, but later got "hijacked" by xtm)
[2] https://www.postgresql.org/message-id/flat/F2766B97-555D-424F-B29F-E0CA0F6D1D74%40postgrespro.ru
[3] https://www.postgresql.org/message-id/flat/56E3CAAE.6060407%40postgrespro.ru
[4] https://wiki.postgresql.org/wiki/DTM
[5] https://dl.acm.org/citation.cfm?id=2553434
[6]
https://www.postgresql.org/message-id/flat/CA%2BCSw_tEpJ%3Dmd1zgxPkjH6CWDnTDft4gBi%3D%2BP9SnoC%2BWy3pKdA%40mail.gmail.com
[7] https://www.postgresql.org/message-id/flat/CAFjFpRfPoDAzf3x-fs86nDwJ4FAwn2cZ%2BxdmbdDPSChU-kt7%2BQ%40mail.gmail.com






Attachment

Re: Global snapshots

From
Robert Haas
Date:
On Tue, May 1, 2018 at 12:27 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Here proposed a set of patches that allow achieving proper snapshot isolation
> semantics in the case of cross-node transactions. Provided infrastructure to
> synchronize snapshots between different Postgres instances and a way to
> atomically commit such transaction with respect to other concurrent global and
> local transactions. Such global transactions can be coordinated outside of
> Postgres by using provided SQL functions or through postgres_fdw, which make use
> of this functions on remote nodes transparently.

I'm concerned about the provisioning aspect of this problem.  Suppose
I have two existing database systems with, perhaps, wildly different
XID counters.  On a certain date, I want to start using this system.
Or conversely, I have two systems that are bonded together using this
system from the beginning, and then, as of a certain date, I want to
break them apart into two standalone systems.  In your proposed
design, are things like this possible?  Can you describe the setup
involved?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Global snapshots

From
Stas Kelvich
Date:

> On 1 May 2018, at 22:43, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> I'm concerned about the provisioning aspect of this problem.  Suppose
> I have two existing database systems with, perhaps, wildly different
> XID counters.  On a certain date, I want to start using this system.

Yes, that totally possible. On both systems you need:

* set track_global_snapshots='on' -- this will start writing each
  transaction commit sequence number to SRLU.
* set global_snapshot_defer_time to 30 seconds, for example -- this
  will delay oldestXmin advancement for specified amount of time,
  preserving old tuples.
* restart database
* optionally enable NTPd if it wasn't enabled.

Also it is possible to avoid reboot, but that will require some careful
work: after enabling track_global_snapshots it will be safe to start
global transactions only when all concurrently running transactions
will finish. More or less equivalent thing happens during logical slot
creation.

> Or conversely, I have two systems that are bonded together using this
> system from the beginning, and then, as of a certain date, I want to
> break them apart into two standalone systems.  In your proposed
> design, are things like this possible?  Can you describe the setup
> involved?

Well, they are not actually "bonded" in any persistent way. If there will
be no distributed transactions, there will be no any logical or physical
connection between that nodes.

And returning to your original concern about "wildly different XID
counters" I want to emphasise that only thing that is floating between
nodes is a GlobalCSN's during start and commit of distributed transaction.
And that GlobalCSN is actually a timestamp of commit, the real one, from
clock_gettime(). And clock time is supposedly more or less the same
on different nodes in normal condition. But correctness here will not
depend on degree of clock synchronisation, only performance of
global transactions will.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Peter Eisentraut
Date:
On 5/1/18 12:27, Stas Kelvich wrote:
> Clock-SI is described in [5] and here I provide a small overview, which
> supposedly should be enough to catch the idea. Assume that each node runs Commit
> Sequence Number (CSN) based visibility: database tracks one counter for each
> transaction start (xid) and another counter for each transaction commit (csn).
> In such setting, a snapshot is just a single number -- a copy of current CSN at
> the moment when the snapshot was taken. Visibility rules are boiled down to
> checking whether current tuple's CSN is less than our snapshot's csn. Also it
> worth of mentioning that for the last 5 years there is an active proposal to
> switch Postgres to CSN-based visibility [6].

But that proposal has so far not succeeded.  How are you overcoming the
reasons for that?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Global snapshots

From
Stas Kelvich
Date:

> On 2 May 2018, at 05:58, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>
> On 5/1/18 12:27, Stas Kelvich wrote:
>> Clock-SI is described in [5] and here I provide a small overview, which
>> supposedly should be enough to catch the idea. Assume that each node runs Commit
>> Sequence Number (CSN) based visibility: database tracks one counter for each
>> transaction start (xid) and another counter for each transaction commit (csn).
>> In such setting, a snapshot is just a single number -- a copy of current CSN at
>> the moment when the snapshot was taken. Visibility rules are boiled down to
>> checking whether current tuple's CSN is less than our snapshot's csn. Also it
>> worth of mentioning that for the last 5 years there is an active proposal to
>> switch Postgres to CSN-based visibility [6].
>
> But that proposal has so far not succeeded.  How are you overcoming the
> reasons for that?

Well, CSN proposal is aiming to switch all postgres visibility stuff with CSN.
This proposal is far more ambitious and original postgres visibility with
snapshots being arrays of XIDs is preserved. In this patch CSNs are written
to SLRU during commit (in a way like commit_ts does) but will be read in two
cases:

1) When the local transaction faced XID that in progress according to XIP-based
snapshot, it CSN need to be checked, as it may already be InDoubt. XIDs that
viewed as committed doesn't need that check (in [6] they also need to be
checked through SLRU).

2) If we are in backend that imported global snapshot, then only CSN-based
visibility can be used. But that happens only for global transactions.

So I hope that local transactions performance will be affected only by
in-progress check and there are ways to circumvent this check.

Also all this behaviour is optional and can be switched off by not enabling
track_global_snapshots.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Masahiko Sawada
Date:
On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> 1) To achieve commit atomicity of different nodes intermediate step is
>    introduced: at first running transaction is marked as InDoubt on all nodes,
>    and only after that each node commit it and stamps with a given GlobalCSN.
>    All readers who ran into tuples of an InDoubt transaction should wait until
>    it ends and recheck visibility.

I'm concerned that long-running transaction could keep other
transactions waiting and then the system gets stuck. Can this happen?
and is there any workaround?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Global snapshots

From
Stas Kelvich
Date:

> On 3 May 2018, at 18:28, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>> 1) To achieve commit atomicity of different nodes intermediate step is
>>   introduced: at first running transaction is marked as InDoubt on all nodes,
>>   and only after that each node commit it and stamps with a given GlobalCSN.
>>   All readers who ran into tuples of an InDoubt transaction should wait until
>>   it ends and recheck visibility.
>
> I'm concerned that long-running transaction could keep other
> transactions waiting and then the system gets stuck. Can this happen?
> and is there any workaround?

InDoubt state is set just before commit, so it is short-lived state.
During transaction execution global tx looks like an ordinary running
transaction. Failure during 2PC with coordinator not being able to
commit/abort this prepared transaction can result in situation where
InDoubt tuples will be locked for reading, but in such situation
coordinator should be blamed. Same problems will arise if prepared
transaction holds locks, for example.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Robert Haas
Date:
On Tue, May 1, 2018 at 5:02 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Yes, that totally possible. On both systems you need:

Cool.

> * set track_global_snapshots='on' -- this will start writing each
>   transaction commit sequence number to SRLU.
> * set global_snapshot_defer_time to 30 seconds, for example -- this
>   will delay oldestXmin advancement for specified amount of time,
>   preserving old tuples.

So, is the idea that we'll definitely find out about any remote
transactions within 30 seconds, and then after we know about remote
transactions, we'll hold back OldestXmin some other way?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Global snapshots

From
Stas Kelvich
Date:

> On 4 May 2018, at 22:09, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> So, is the idea that we'll definitely find out about any remote
> transactions within 30 seconds, and then after we know about remote
> transactions, we'll hold back OldestXmin some other way?

Yes, kind of. There is a procArray->global_snapshot_xmin variable which
acts as a barrier to xmin calculations in GetOldestXmin and
GetSnapshotData, when set.

Also each second GetSnapshotData writes globalxmin (as it was before
procArray->global_snapshot_xmin was taken into account) into a circle
buffer with a size equal to global_snapshot_defer_time value. That more
or less the same thing as with Snapshot Too Old feature, but with a
bucket size of 1 second instead of 1 minute.
procArray->global_snapshot_xmin is always set to oldest
value in circle buffer.

This way xmin calculation is always gives a value that were
global_snapshot_xmin seconds ago and we have mapping from time (or
GlobalCSN) to globalxmin for each second in this range. So when
some backends imports global snapshot with some GlobalCSN, that
GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Robert Haas
Date:
On Sun, May 6, 2018 at 6:22 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Also each second GetSnapshotData writes globalxmin (as it was before
> procArray->global_snapshot_xmin was taken into account) into a circle
> buffer with a size equal to global_snapshot_defer_time value. That more
> or less the same thing as with Snapshot Too Old feature, but with a
> bucket size of 1 second instead of 1 minute.
> procArray->global_snapshot_xmin is always set to oldest
> value in circle buffer.
>
> This way xmin calculation is always gives a value that were
> global_snapshot_xmin seconds ago and we have mapping from time (or
> GlobalCSN) to globalxmin for each second in this range. So when
> some backends imports global snapshot with some GlobalCSN, that
> GlobalCSN is mapped to a xmin and this xmin is set as a Proc->xmin.

But what happens if a transaction starts on node A at time T0 but
first touches node B at a much later time T1, such that T1 - T0 >
global_snapshot_defer_time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Global snapshots

From
Stas Kelvich
Date:

> On 7 May 2018, at 20:04, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> But what happens if a transaction starts on node A at time T0 but
> first touches node B at a much later time T1, such that T1 - T0 >
> global_snapshot_defer_time?
> 

Such transaction will get "global snapshot too old" error.

In principle such behaviour can be avoided by calculating oldest
global csn among all cluster nodes and oldest xmin on particular
node will be held only when there is some open old transaction on
other node. It's easy to do from global snapshot point of view,
but it's not obvious how to integrate that into postgres_fdw. Probably
that will require bi-derectional connection between postgres_fdw nodes
(also distributed deadlock detection will be easy with such connection).

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Robert Haas
Date:
On Tue, May 8, 2018 at 4:51 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>> On 7 May 2018, at 20:04, Robert Haas <robertmhaas@gmail.com> wrote:
>> But what happens if a transaction starts on node A at time T0 but
>> first touches node B at a much later time T1, such that T1 - T0 >
>> global_snapshot_defer_time?
>
> Such transaction will get "global snapshot too old" error.

Ouch.  That's not so bad at READ COMMITTED, but at higher isolation
levels failure becomes extremely likely.  Any multi-statement
transaction that lasts longer than global_snapshot_defer_time is
pretty much doomed.

> In principle such behaviour can be avoided by calculating oldest
> global csn among all cluster nodes and oldest xmin on particular
> node will be held only when there is some open old transaction on
> other node. It's easy to do from global snapshot point of view,
> but it's not obvious how to integrate that into postgres_fdw. Probably
> that will require bi-derectional connection between postgres_fdw nodes
> (also distributed deadlock detection will be easy with such connection).

I don't think holding back xmin is a very good strategy.  Maybe it
won't be so bad if and when we get zheap, since only the undo log will
bloat rather than the table.  But as it stands, holding back xmin
means everything bloats and you have to CLUSTER or VACUUM FULL the
table in order to fix it.

If the behavior were really analogous to our existing "snapshot too
old" feature, it wouldn't be so bad.  Old snapshots continue to work
without error so long as they only read unmodified data, and only
error out if they hit modified pages.  SERIALIZABLE works according to
a similar principle: it worries about data that is written by one
transaction and read by another, but if there's a portion of the data
that is only read and not written, or at least not written by any
transactions that were active around the same time, then it's fine.
While the details aren't really clear to me, I'm inclined to think
that any solution we adopt for global snapshots ought to leverage this
same principle in some way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Global snapshots

From
Masahiko Sawada
Date:
On Fri, May 4, 2018 at 2:11 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>
>
>> On 3 May 2018, at 18:28, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Wed, May 2, 2018 at 1:27 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>>> 1) To achieve commit atomicity of different nodes intermediate step is
>>>   introduced: at first running transaction is marked as InDoubt on all nodes,
>>>   and only after that each node commit it and stamps with a given GlobalCSN.
>>>   All readers who ran into tuples of an InDoubt transaction should wait until
>>>   it ends and recheck visibility.
>>
>> I'm concerned that long-running transaction could keep other
>> transactions waiting and then the system gets stuck. Can this happen?
>> and is there any workaround?
>
> InDoubt state is set just before commit, so it is short-lived state.
> During transaction execution global tx looks like an ordinary running
> transaction. Failure during 2PC with coordinator not being able to
> commit/abort this prepared transaction can result in situation where
> InDoubt tuples will be locked for reading, but in such situation
> coordinator should be blamed. Same problems will arise if prepared
> transaction holds locks, for example.

Thank you for explanation! I understood that algorithm. I have two questions.

If I understand correctly, simple writes with ordinary 2PC doesn't
block read who reads that writes. For example, an insertion on a node
doesn't block readers who reads the inserted tuple. But in this global
transaction, the read will be blocked during the global transaction is
InDoubt state. Is that right? InDoubt state will be short-live state
if it's local transaction but I'm not sure in global transaction.
Because during InDoubt state the coordinator has to prepare on all
participant nodes and to assign the global csn to them (and end global
transaction) the global transaction could be in InDoubt state for a
relatively long time. Also, it could be more longer if the
commit/rollback prepared never be performed due to a failure of any
nodes of them.

With this patch, can we start a remote transaction at READ COMMITTED
with imported a global snapshot if the local transaction started at
READ COMMITTED?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Global snapshots

From
Stas Kelvich
Date:

> On 9 May 2018, at 17:51, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Ouch.  That's not so bad at READ COMMITTED, but at higher isolation
> levels failure becomes extremely likely.  Any multi-statement
> transaction that lasts longer than global_snapshot_defer_time is
> pretty much doomed.

Ouch indeed. Current xmin holding scheme has two major drawbacks: it introduces
timeout between export/import of snapshot and holds xmin in pessimistic way, so
old versions will be preserved even when there were no global transactions. On
a positive side is simplicity: that is the only way which I can think of that
doesn't require distributed calculation of global xmin, which in turn, will
probably require permanent connection to remote postgres_fdw node. It is not
hard to add some background worker to postgres_fdw that will hold permanent
connection, but I afraid that it is very discussion-prone topic and that's why
I tried to avoid that.

> I don't think holding back xmin is a very good strategy.  Maybe it
> won't be so bad if and when we get zheap, since only the undo log will
> bloat rather than the table.  But as it stands, holding back xmin
> means everything bloats and you have to CLUSTER or VACUUM FULL the
> table in order to fix it.

Well, opened local transaction in postgres holds globalXmin for whole postgres
instance (with exception of STO). Also active global transaction should hold
globalXmin of participating nodes to be able to read right versions (again,
with exception of STO).
  However, xmin holding scheme itself can be different. For example we can
periodically check (lets say every 1-2 seconds) oldest GlobalCSN on each node
and delay globalXmin advancement only if there is really exist some long
transaction. So the period of bloat will be limited by this 1-2 seconds, and
will not impose timeout between export/import.
  Also, I want to note, that global_snapshot_defer_time values about
of tens of seconds doesn't change much in terms of bloat comparing to logical
replication. Active logical slot holds globalXmin by setting
replication_slot_xmin, which is advanced on every RunningXacts, which in turn
logged every 15 seconds (hardcoded in LOG_SNAPSHOT_INTERVAL_MS).

> If the behavior were really analogous to our existing "snapshot too
> old" feature, it wouldn't be so bad.  Old snapshots continue to work
> without error so long as they only read unmodified data, and only
> error out if they hit modified pages.

That is actually a good idea that I missed, thanks. Really since all logic
for checking modified pages is already present, it is possible to reuse that
and don't raise "Global STO" error right when old snapshot is imported, but
only in case when global transaction read modified page. I will implement
that and update patch set.

Summarising, I think, that introducing some permanent connections to
postgres_fdw node will put too much burden on this patch set and that it will
be possible to address that later (in a long run such connection will be anyway
needed at least for a deadlock detection). However, if you think that current
behavior + STO analog isn't good enough, then I'm ready to pursue that track.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Stas Kelvich
Date:

> On 11 May 2018, at 04:05, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> If I understand correctly, simple writes with ordinary 2PC doesn't
> block read who reads that writes. For example, an insertion on a node
> doesn't block readers who reads the inserted tuple. But in this global
> transaction, the read will be blocked during the global transaction is
> InDoubt state. Is that right? InDoubt state will be short-live state
> if it's local transaction but I'm not sure in global transaction.
> Because during InDoubt state the coordinator has to prepare on all
> participant nodes and to assign the global csn to them (and end global
> transaction) the global transaction could be in InDoubt state for a
> relatively long time.

What I meant by "short-lived" is that InDoubt is set after transaction
is prepared so it doesn't depend on size of transaction, only on
network/commit latency. So you can have transaction that did bulk load for
a several hours and still InDoubt state will last for network round-trip
and possibly fsync that can happened during logging of commit record.

> Also, it could be more longer if the
> commit/rollback prepared never be performed due to a failure of any
> nodes of them.

In this case it definitely can. Individual node can not know whether
that InDoubt transaction was somewhere committed or aborted, so it should
wait until somebody will finish this tx. Particular time to recover
depends on how failures are handled.

Speaking more generally in presence of failures we can unlock tuples
only when consensus on transaction commit is reached. And FLP theorem
states that it can take indefinitely long time in fully asynchronous
network. However from more practical PoW probability of such behaviour
in real network becomes negligible after several iterations of voting process
(some evaluations can be found in https://ieeexplore.ieee.org/document/1352999/).
So several roundtrips can be a decent approximation of how long it should
take to recover from InDoubt state in case failure.

> With this patch, can we start a remote transaction at READ COMMITTED
> with imported a global snapshot if the local transaction started at
> READ COMMITTED?

In theory it is possible, one just need to send new snapshot before each
statement. With some amount of careful work it is possible to achieve
READ COMMITED with postgres_fwd using global snapshots.


--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Robert Haas
Date:
On Mon, May 14, 2018 at 7:20 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> Summarising, I think, that introducing some permanent connections to
> postgres_fdw node will put too much burden on this patch set and that it will
> be possible to address that later (in a long run such connection will be anyway
> needed at least for a deadlock detection). However, if you think that current
> behavior + STO analog isn't good enough, then I'm ready to pursue that track.

I don't think I'd be willing to commit to a particular approach at
this point.  I think the STO analog is an interesting idea and worth
more investigation, and I think the idea of a permanent connection
with chatter that can be used to resolve deadlocks, coordinate shared
state, etc. is also an interesting idea.  But there are probably lots
of ideas that somebody could come up with in this area that would
sound interesting but ultimately not work out.  Also, an awful lot
depends on quality of implementation.  If you come up with an
implementation of a permanent connection for coordination "chatter",
and the patch gets rejected, it's almost certainly not a sign that we
don't want that thing in general.  It means we don't want yours.  :-)

Actually, I think if we're going to pursue that approach, we ought to
back off a bit from thinking about global snapshots and think about
what kind of general mechanism we want.  For example, maybe you can
imagine it like a message bus, where there are a bunch of named
channels on which the server publishes messages and you can listen to
the ones you care about.  There could, for example, be a channel that
publishes the new system-wide globalxmin every time it changes, and
another channel that publishes the wait graph every time the deadlock
detector runs, and so on.  In fact, perhaps we should consider
implementing it using the existing LISTEN/NOTIFY framework: have a
bunch of channels that are predefined by PostgreSQL itself, and set
things up so that the server automatically begins publishing to those
channels as soon as anybody starts listening to them.  I have to
imagine that if we had a good mechanism for this, we'd get all sorts
of proposals for things to publish.  As long as they don't impose
overhead when nobody's listening, we should be able to be fairly
accommodating of such requests.

Or maybe that model is too limiting, either because we don't want to
broadcast to everyone but rather send specific messages to specific
connections, or else because we need a request-and-response mechanism
rather than what is in some sense a one-way communication channel.
Regardless, we should start by coming up with the right model for the
protocol first, bearing in mind how it's going to be used and other
things for which somebody might want to use it (deadlock detection,
failover, leader election), and then implement whatever we need for
global snapshots on top of it.  I don't think that writing the code
here is going to be hugely difficult, but coming up with a good design
is going to require some thought and discussion.

And, for that matter, I think the same thing is true for global
snapshots.  The coding is a lot harder for that than it is for some
new subprotocol, I'd imagine, but it's still easier than coming up
with a good design.  As far as I can see, and everybody can decide for
themselves how far they think that is, the proposal you're making now
sounds like a significant improvement over the XTM proposal.  In
particular, the provisioning and deprovisioning issues sound like they
have been thought through a lot more.  I'm happy to call that
progress.  At the same time, progress on a journey is not synonymous
with arrival at the destination, and I guess it seems to me that you
have some further research to do along the lines you've described:

1. Can we hold back xmin only when necessary and to the extent
necessary instead of all the time?
2. Can we use something like an STO analog, maybe as an optional
feature, rather than actually holding back xmin?

And I'd add:

3. Is there another approach altogether that doesn't rely on holding
back xmin at all?

For example, if you constructed the happens-after graph between
transactions in shared memory, including actions on all nodes, and
looked for cycles, you could abort transactions that would complete a
cycle.  (We say A happens-after B if A reads or writes data previously
written by B.)  If no cycle exists then all is well.  I'm pretty sure
it's been well-established that a naive implementation of this
algorithm is terribly unperformant, but for example SSI works on this
principle.  It reduces the bookkeeping involved by being willing to
abort transactions that aren't really creating a cycle if they look
like they *might* create a cycle.  Now that's an implementation *on
top of* snapshots for the purpose of getting true serializability
rather than a way of getting global snapshots per se, so it's not
suitable for what you're trying do here, but I think it shows that
algorithms based on cycle detection can be made practical in some
cases, and so maybe this is another such case.  On the other hand,
this whole line of thinking could also be a dead end...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Global snapshots

From
Stas Kelvich
Date:

> On 15 May 2018, at 15:53, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Actually, I think if we're going to pursue that approach, we ought to
> back off a bit from thinking about global snapshots and think about
> what kind of general mechanism we want.  For example, maybe you can
> imagine it like a message bus, where there are a bunch of named
> channels on which the server publishes messages and you can listen to
> the ones you care about.  There could, for example, be a channel that
> publishes the new system-wide globalxmin every time it changes, and
> another channel that publishes the wait graph every time the deadlock
> detector runs, and so on.  In fact, perhaps we should consider
> implementing it using the existing LISTEN/NOTIFY framework: have a
> bunch of channels that are predefined by PostgreSQL itself, and set
> things up so that the server automatically begins publishing to those
> channels as soon as anybody starts listening to them.  I have to
> imagine that if we had a good mechanism for this, we'd get all sorts
> of proposals for things to publish.  As long as they don't impose
> overhead when nobody's listening, we should be able to be fairly
> accommodating of such requests.
>
> Or maybe that model is too limiting, either because we don't want to
> broadcast to everyone but rather send specific messages to specific
> connections, or else because we need a request-and-response mechanism
> rather than what is in some sense a one-way communication channel.
> Regardless, we should start by coming up with the right model for the
> protocol first, bearing in mind how it's going to be used and other
> things for which somebody might want to use it (deadlock detection,
> failover, leader election), and then implement whatever we need for
> global snapshots on top of it.  I don't think that writing the code
> here is going to be hugely difficult, but coming up with a good design
> is going to require some thought and discussion.

Well, it would be cool to have some general mechanism to unreliably send
messages between postgres instances. I was thinking about the same thing
mostly in context of our multimaster, where we have an arbiter bgworker
which collects 2PC responses and heartbeats from other nodes on different
TCP port. It used to have some logic inside but evolved to just sending
messages from shared memory out queue and wake backends upon message arrival.
But necessity to manage second port is painful and error-prone at least
from configuration point of view. So it would be nice to have more general
mechanism to exchange messages via postgres port. Ideally with interface
like in shm_mq: send some messages in one queue, subscribe to responses
in different. Among other thing that were mentioned (xmin, deadlock,
elections/heartbeats) I especially interested in some multiplexing for
postgres_fdw, to save on context switches of individual backends while
sending statements.

Talking about model, I think it would be cool to have some primitives like
ones provided by ZeroMQ (message push/subscribe/pop) and then implement
on top of them some more complex ones like scatter/gather.

However, that's probably topic for a very important, but different thread.
For the needs of global snapshots something less ambitious will be suitable.

> And, for that matter, I think the same thing is true for global
> snapshots.  The coding is a lot harder for that than it is for some
> new subprotocol, I'd imagine, but it's still easier than coming up
> with a good design.

Sure. This whole global snapshot thing experienced several internal redesigns,
before becoming satisfactory from our standpoint. However, nothing refraining
us from next iterations. In this regard, it is interesting to also hear comments
from Postgres-XL team -- from my experience with XL code this patches in
core can help XL to drop a lot of visibility-related ifdefs and seriously
offload GTM. But may be i'm missing something.

> I guess it seems to me that you
> have some further research to do along the lines you've described:
>
> 1. Can we hold back xmin only when necessary and to the extent
> necessary instead of all the time?
> 2. Can we use something like an STO analog, maybe as an optional
> feature, rather than actually holding back xmin?

Yes, to both questions. I'll implement that and share results.

> And I'd add:
>
> 3. Is there another approach altogether that doesn't rely on holding
> back xmin at all?

And for that question I believe the answer is no. If we want to keep
MVCC-like behaviour where read transactions aren't randomly aborted, we
will need to keep old versions. Disregarding whether it is local or global
transaction. And to keep old versions we need to hold xmin to defuse HOT,
microvacuum, macrovacuum, visibility maps, etc. At some point we can switch
to STO-like behaviour, but that probably should be used as protection from
unusually long transactions rather then a standard behavior.

> For example, if you constructed the happens-after graph between
> transactions in shared memory, including actions on all nodes, and
> looked for cycles, you could abort transactions that would complete a
> cycle.  (We say A happens-after B if A reads or writes data previously
> written by B.)  If no cycle exists then all is well.

Well, again, it seem to me that any kind of transaction scheduler that
guarantees that RO will not abort (even if it is special kind of RO like
read only deferred) needs to keep old versions.

Speaking about alternative approaches, good evaluation of algorithms
can be found in [HARD17]. Postgres model is close to MVCC described in article
and if we enable STO with small timeout then it will be close to TIMESTAMP
algorithm in article. Results shows that both MVCC and TIMESTAMP are less
performant then CALVIN approach =) But that one is quite different from what
is done in Postgres (and probably in all other databases except Calvin/Fauna
itself) in last 20-30 years.

Also looking through bunch of articles I found that one of the first articles
about MVCC [REED78] (I though first was [BERN83], but actually he references
bunch of previous articles and [REED78] is one them) was actually about distributed
transactions and uses more or less the same approach with pseudo-time in their
terminology to order transaction and assign snapshots.

[HARD17] https://dl.acm.org/citation.cfm?id=3055548
[REED78] https://dl.acm.org/citation.cfm?id=889815
[BERN83] https://dl.acm.org/citation.cfm?id=319998

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Arseny Sher
Date:
Hello,

I have looked through the patches and found them pretty accurate. I'd
fixed a lot of small issues here and there; updated patchset is
attached. But first, some high-level notes:

 * I agree that it would be cool to implement functionality like current
   "snapshot too old": that is, abort transaction with old global
   snapshot only if it really attempted to touch modified data.

 * I also agree with Stas that any attempts to trade oldestxmin in
   gossip between the nodes would drastically complicate this patch and
   make it discussion-prone; it would be nice first to get some feedback
   on general approach, especially from people trying to distribute
   Postgres.

 * One drawback of these patches is that only REPEATABLE READ is
   supported. For READ COMMITTED, we must export every new snapshot
   generated on coordinator to all nodes, which is fairly easy to
   do. SERIALIZABLE will definitely require chattering between nodes,
   but that's much less demanded isolevel (e.g. we still don't support
   it on replicas).

 * Another somewhat serious issue is that there is a risk of recency
   guarantee violation. If client starts transaction at node with
   lagging clocks, its snapshot might not include some recently
   committed transactions; if client works with different nodes, she
   might not even see her own changes. CockroachDB describes at [1] how
   they and Google Spanner overcome this problem. In short, both set
   hard limit on maximum allowed clock skew.  Spanner uses atomic
   clocks, so this skew is small and they just wait it at the end of
   each transaction before acknowledging the client. In CockroachDB, if
   tuple is not visible but we are unsure whether it is truly invisible
   or it's just the skew (the difference between snapshot and tuple's
   csn is less than the skew), transaction is restarted with advanced
   snapshot. This process is not infinite because the upper border
   (initial snapshot + max skew) stays the same; this is correct as we
   just want to ensure that our xact sees all the committed ones before
   it started. We can implement the same thing.

Now, the description of my mostly cosmetical changes:

 * Don't ERROR in BroadcastStmt to allow us to handle failure manually;
 * Check global_snapshot_defer_time in ImportGlobalSnapshot instead of
   falling on assert;
 * (Arguably) improved comments around locking at circular buffer
   maintenance; also, don't lock procarray during global_snapshot_xmin
   bump.
 * s/snaphot/snapshot, other typos.
 * Don't track_global_snapshots by default -- while handy for testing, it
   doesn't look generally good.
 * Set track_global_snapshots = true in tests everywhere.
 * GUC renamed from postgres_fdw.use_tsdtm to
   postgres_fdw.use_global_snapshots for consistency.
 * 003_bank_shared.pl test is removed. In current shape (loading one
   node) it is useless, and if we bombard both nodes, deadlock surely
   appears. In general, global snaphots are not needed for such
   multimaster-like setup -- either there are no conflicts and we are
   fine, or there is a conflict, in which case we get a deadlock.
 * Fix initdb failure with non-zero global_snapshot_defer_time.
 * Enforce REPEATABLE READ since currently we export snap only once in
   xact.
 * Remove assertion that circular buffer entries are monotonic, as
   GetOldestXmin *can* go backwards.


[1] https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From 21687e75366df03b92e48c6125bb2e90d01bb70a Mon Sep 17 00:00:00 2001
From: Stas Kelvich <stanconn@gmail.com>
Date: Wed, 25 Apr 2018 16:05:46 +0300
Subject: [PATCH 1/3] GlobalCSNLog SLRU

---
 src/backend/access/transam/Makefile         |   3 +-
 src/backend/access/transam/global_csn_log.c | 439 ++++++++++++++++++++++++++++
 src/backend/access/transam/twophase.c       |   1 +
 src/backend/access/transam/varsup.c         |   2 +
 src/backend/access/transam/xlog.c           |  12 +
 src/backend/storage/ipc/ipci.c              |   3 +
 src/backend/storage/ipc/procarray.c         |   3 +
 src/backend/storage/lmgr/lwlocknames.txt    |   1 +
 src/backend/utils/misc/guc.c                |   9 +
 src/backend/utils/probes.d                  |   2 +
 src/bin/initdb/initdb.c                     |   3 +-
 src/include/access/global_csn_log.h         |  30 ++
 src/include/storage/lwlock.h                |   1 +
 src/include/utils/snapshot.h                |   3 +
 14 files changed, 510 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/transam/global_csn_log.c
 create mode 100644 src/include/access/global_csn_log.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47269..03aa360ea3 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,7 +12,8 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
+OBJS = clog.o commit_ts.o global_csn_log.o generic_xlog.o \
+    multixact.o parallel.o rmgr.o slru.o \
     subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
     xact.o xlog.o xlogarchive.o xlogfuncs.o \
     xloginsert.o xlogreader.o xlogutils.o
diff --git a/src/backend/access/transam/global_csn_log.c b/src/backend/access/transam/global_csn_log.c
new file mode 100644
index 0000000000..d9d66528e4
--- /dev/null
+++ b/src/backend/access/transam/global_csn_log.c
@@ -0,0 +1,439 @@
+/*-----------------------------------------------------------------------------
+ *
+ * global_csn_log.c
+ *        Track global commit sequence numbers of finished transactions
+ *
+ * Implementation of cross-node transaction isolation relies on commit sequence
+ * number (CSN) based visibility rules.  This module provides SLRU to store
+ * CSN for each transaction.  This mapping need to be kept only for xid's
+ * greater then oldestXid, but that can require arbitrary large amounts of
+ * memory in case of long-lived transactions.  Because of same lifetime and
+ * persistancy requirements this module is quite similar to subtrans.c
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_csn_log.c
+ *
+ *-----------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/slru.h"
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "miscadmin.h"
+#include "pg_trace.h"
+#include "utils/snapmgr.h"
+
+bool track_global_snapshots;
+
+/*
+ * Defines for GlobalCSNLog page sizes.  A page is the same BLCKSZ as is used
+ * everywhere else in Postgres.
+ *
+ * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
+ * GlobalCSNLog page numbering also wraps around at
+ * 0xFFFFFFFF/GLOBAL_CSN_LOG_XACTS_PER_PAGE, and GlobalCSNLog segment numbering at
+ * 0xFFFFFFFF/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
+ * explicit notice of that fact in this module, except when comparing segment
+ * and page numbers in TruncateGlobalCSNLog (see GlobalCSNLogPagePrecedes).
+ */
+
+/* We store the commit GlobalCSN for each xid */
+#define GCSNLOG_XACTS_PER_PAGE (BLCKSZ / sizeof(GlobalCSN))
+
+#define TransactionIdToPage(xid)    ((xid) / (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) GCSNLOG_XACTS_PER_PAGE)
+
+/*
+ * Link to shared-memory data structures for CLOG control
+ */
+static SlruCtlData GlobalCSNLogCtlData;
+#define GlobalCsnlogCtl (&GlobalCSNLogCtlData)
+
+static int    ZeroGlobalCSNLogPage(int pageno);
+static bool GlobalCSNLogPagePrecedes(int page1, int page2);
+static void GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+                                      TransactionId *subxids,
+                                      GlobalCSN csn, int pageno);
+static void GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn,
+                                      int slotno);
+
+/*
+ * GlobalCSNLogSetCSN
+ *
+ * Record GlobalCSN of transaction and its subtransaction tree.
+ *
+ * xid is a single xid to set status for. This will typically be the top level
+ * transactionid for a top level commit or abort. It can also be a
+ * subtransaction when we record transaction aborts.
+ *
+ * subxids is an array of xids of length nsubxids, representing subtransactions
+ * in the tree of xid. In various cases nsubxids may be zero.
+ *
+ * csn is the commit sequence number of the transaction. It should be
+ * AbortedGlobalCSN for abort cases.
+ */
+void
+GlobalCSNLogSetCSN(TransactionId xid, int nsubxids,
+                     TransactionId *subxids, GlobalCSN csn)
+{
+    int            pageno;
+    int            i = 0;
+    int            offset = 0;
+
+    /* Callers of GlobalCSNLogSetCSN() must check GUC params */
+    Assert(track_global_snapshots);
+
+    Assert(TransactionIdIsValid(xid));
+
+    pageno = TransactionIdToPage(xid);        /* get page of parent */
+    for (;;)
+    {
+        int            num_on_page = 0;
+
+        while (i < nsubxids && TransactionIdToPage(subxids[i]) == pageno)
+        {
+            num_on_page++;
+            i++;
+        }
+
+        GlobalCSNLogSetPageStatus(xid,
+                            num_on_page, subxids + offset,
+                            csn, pageno);
+        if (i >= nsubxids)
+            break;
+
+        offset = i;
+        pageno = TransactionIdToPage(subxids[offset]);
+        xid = InvalidTransactionId;
+    }
+}
+
+/*
+ * Record the final state of transaction entries in the csn log for
+ * all entries on a single page.  Atomic only on this page.
+ *
+ * Otherwise API is same as TransactionIdSetTreeStatus()
+ */
+static void
+GlobalCSNLogSetPageStatus(TransactionId xid, int nsubxids,
+                           TransactionId *subxids,
+                           GlobalCSN csn, int pageno)
+{
+    int            slotno;
+    int            i;
+
+    LWLockAcquire(GlobalCSNLogControlLock, LW_EXCLUSIVE);
+
+    slotno = SimpleLruReadPage(GlobalCsnlogCtl, pageno, true, xid);
+
+    /* Subtransactions first, if needed ... */
+    for (i = 0; i < nsubxids; i++)
+    {
+        Assert(GlobalCsnlogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
+        GlobalCSNLogSetCSNInSlot(subxids[i],    csn, slotno);
+    }
+
+    /* ... then the main transaction */
+    if (TransactionIdIsValid(xid))
+        GlobalCSNLogSetCSNInSlot(xid, csn, slotno);
+
+    GlobalCsnlogCtl->shared->page_dirty[slotno] = true;
+
+    LWLockRelease(GlobalCSNLogControlLock);
+}
+
+/*
+ * Sets the commit status of a single transaction.
+ */
+static void
+GlobalCSNLogSetCSNInSlot(TransactionId xid, GlobalCSN csn, int slotno)
+{
+    int            entryno = TransactionIdToPgIndex(xid);
+    GlobalCSN *ptr;
+
+    Assert(LWLockHeldByMe(GlobalCSNLogControlLock));
+
+    ptr = (GlobalCSN *) (GlobalCsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+
+    *ptr = csn;
+}
+
+/*
+ * Interrogate the state of a transaction in the log.
+ *
+ * NB: this is a low-level routine and is NOT the preferred entry point
+ * for most uses; TransactionIdGetGlobalCSN() in global_snapshot.c is the
+ * intended caller.
+ */
+GlobalCSN
+GlobalCSNLogGetCSN(TransactionId xid)
+{
+    int            pageno = TransactionIdToPage(xid);
+    int            entryno = TransactionIdToPgIndex(xid);
+    int            slotno;
+    GlobalCSN *ptr;
+    GlobalCSN    global_csn;
+
+    /* Callers of GlobalCSNLogGetCSN() must check GUC params */
+    Assert(track_global_snapshots);
+
+    /* Can't ask about stuff that might not be around anymore */
+    Assert(TransactionIdFollowsOrEquals(xid, TransactionXmin));
+
+    /* lock is acquired by SimpleLruReadPage_ReadOnly */
+
+    slotno = SimpleLruReadPage_ReadOnly(GlobalCsnlogCtl, pageno, xid);
+    ptr = (GlobalCSN *) (GlobalCsnlogCtl->shared->page_buffer[slotno] + entryno * sizeof(XLogRecPtr));
+    global_csn = *ptr;
+
+    LWLockRelease(GlobalCSNLogControlLock);
+
+    return global_csn;
+}
+
+/*
+ * Number of shared GlobalCSNLog buffers.
+ */
+static Size
+GlobalCSNLogShmemBuffers(void)
+{
+    return Min(32, Max(4, NBuffers / 512));
+}
+
+/*
+ * Reserve shared memory for GlobalCsnlogCtl.
+ */
+Size
+GlobalCSNLogShmemSize(void)
+{
+    if (!track_global_snapshots)
+        return 0;
+
+    return SimpleLruShmemSize(GlobalCSNLogShmemBuffers(), 0);
+}
+
+/*
+ * Initialization of shared memory for GlobalCSNLog.
+ */
+void
+GlobalCSNLogShmemInit(void)
+{
+    if (!track_global_snapshots)
+        return;
+
+    GlobalCsnlogCtl->PagePrecedes = GlobalCSNLogPagePrecedes;
+    SimpleLruInit(GlobalCsnlogCtl, "GlobalCSNLog Ctl", GlobalCSNLogShmemBuffers(), 0,
+                  GlobalCSNLogControlLock, "pg_global_csn", LWTRANCHE_GLOBAL_CSN_LOG_BUFFERS);
+}
+
+/*
+ * This func must be called ONCE on system install.  It creates the initial
+ * GlobalCSNLog segment.  The pg_global_csn directory is assumed to have been
+ * created by initdb, and GlobalCSNLogShmemInit must have been called already.
+ */
+void
+BootStrapGlobalCSNLog(void)
+{
+    int            slotno;
+
+    if (!track_global_snapshots)
+        return;
+
+    LWLockAcquire(GlobalCSNLogControlLock, LW_EXCLUSIVE);
+
+    /* Create and zero the first page of the commit log */
+    slotno = ZeroGlobalCSNLogPage(0);
+
+    /* Make sure it's written out */
+    SimpleLruWritePage(GlobalCsnlogCtl, slotno);
+    Assert(!GlobalCsnlogCtl->shared->page_dirty[slotno]);
+
+    LWLockRelease(GlobalCSNLogControlLock);
+}
+
+/*
+ * Initialize (or reinitialize) a page of GlobalCSNLog to zeroes.
+ *
+ * The page is not actually written, just set up in shared memory.
+ * The slot number of the new page is returned.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+ZeroGlobalCSNLogPage(int pageno)
+{
+    Assert(LWLockHeldByMe(GlobalCSNLogControlLock));
+    return SimpleLruZeroPage(GlobalCsnlogCtl, pageno);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend startup,
+ * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+ *
+ * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
+ * if there are none.
+ */
+void
+StartupGlobalCSNLog(TransactionId oldestActiveXID)
+{
+    int            startPage;
+    int            endPage;
+
+    if (!track_global_snapshots)
+        return;
+
+    /*
+     * Since we don't expect pg_global_csn to be valid across crashes, we
+     * initialize the currently-active page(s) to zeroes during startup.
+     * Whenever we advance into a new page, ExtendGlobalCSNLog will likewise
+     * zero the new page without regard to whatever was previously on disk.
+     */
+    LWLockAcquire(GlobalCSNLogControlLock, LW_EXCLUSIVE);
+
+    startPage = TransactionIdToPage(oldestActiveXID);
+    endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
+
+    while (startPage != endPage)
+    {
+        (void) ZeroGlobalCSNLogPage(startPage);
+        startPage++;
+        /* must account for wraparound */
+        if (startPage > TransactionIdToPage(MaxTransactionId))
+            startPage = 0;
+    }
+    (void) ZeroGlobalCSNLogPage(startPage);
+
+    LWLockRelease(GlobalCSNLogControlLock);
+}
+
+/*
+ * This must be called ONCE during postmaster or standalone-backend shutdown
+ */
+void
+ShutdownGlobalCSNLog(void)
+{
+    if (!track_global_snapshots)
+        return;
+
+    /*
+     * Flush dirty GlobalCSNLog pages to disk.
+     *
+     * This is not actually necessary from a correctness point of view. We do
+     * it merely as a debugging aid.
+     */
+    TRACE_POSTGRESQL_GLOBALCSNLOG_CHECKPOINT_START(false);
+    SimpleLruFlush(GlobalCsnlogCtl, false);
+    TRACE_POSTGRESQL_GLOBALCSNLOG_CHECKPOINT_DONE(false);
+}
+
+/*
+ * Perform a checkpoint --- either during shutdown, or on-the-fly
+ */
+void
+CheckPointGlobalCSNLog(void)
+{
+    if (!track_global_snapshots)
+        return;
+
+    /*
+     * Flush dirty GlobalCSNLog pages to disk.
+     *
+     * This is not actually necessary from a correctness point of view. We do
+     * it merely to improve the odds that writing of dirty pages is done by
+     * the checkpoint process and not by backends.
+     */
+    TRACE_POSTGRESQL_GLOBALCSNLOG_CHECKPOINT_START(true);
+    SimpleLruFlush(GlobalCsnlogCtl, true);
+    TRACE_POSTGRESQL_GLOBALCSNLOG_CHECKPOINT_DONE(true);
+}
+
+/*
+ * Make sure that GlobalCSNLog has room for a newly-allocated XID.
+ *
+ * NB: this is called while holding XidGenLock.  We want it to be very fast
+ * most of the time; even when it's not so fast, no actual I/O need happen
+ * unless we're forced to write out a dirty clog or xlog page to make room
+ * in shared memory.
+ */
+void
+ExtendGlobalCSNLog(TransactionId newestXact)
+{
+    int            pageno;
+
+    if (!track_global_snapshots)
+        return;
+
+    /*
+     * No work except at first XID of a page.  But beware: just after
+     * wraparound, the first XID of page zero is FirstNormalTransactionId.
+     */
+    if (TransactionIdToPgIndex(newestXact) != 0 &&
+        !TransactionIdEquals(newestXact, FirstNormalTransactionId))
+        return;
+
+    pageno = TransactionIdToPage(newestXact);
+
+    LWLockAcquire(GlobalCSNLogControlLock, LW_EXCLUSIVE);
+
+    /* Zero the page and make an XLOG entry about it */
+    ZeroGlobalCSNLogPage(pageno);
+
+    LWLockRelease(GlobalCSNLogControlLock);
+}
+
+/*
+ * Remove all GlobalCSNLog segments before the one holding the passed
+ * transaction ID.
+ *
+ * This is normally called during checkpoint, with oldestXact being the
+ * oldest TransactionXmin of any running transaction.
+ */
+void
+TruncateGlobalCSNLog(TransactionId oldestXact)
+{
+    int            cutoffPage;
+
+    if (!track_global_snapshots)
+        return;
+
+    /*
+     * The cutoff point is the start of the segment containing oldestXact. We
+     * pass the *page* containing oldestXact to SimpleLruTruncate. We step
+     * back one transaction to avoid passing a cutoff page that hasn't been
+     * created yet in the rare case that oldestXact would be the first item on
+     * a page and oldestXact == next XID.  In that case, if we didn't subtract
+     * one, we'd trigger SimpleLruTruncate's wraparound detection.
+     */
+    TransactionIdRetreat(oldestXact);
+    cutoffPage = TransactionIdToPage(oldestXact);
+
+    SimpleLruTruncate(GlobalCsnlogCtl, cutoffPage);
+}
+
+/*
+ * Decide which of two GlobalCSNLog page numbers is "older" for truncation
+ * purposes.
+ *
+ * We need to use comparison of TransactionIds here in order to do the right
+ * thing with wraparound XID arithmetic.  However, if we are asked about
+ * page number zero, we don't want to hand InvalidTransactionId to
+ * TransactionIdPrecedes: it'll get weird about permanent xact IDs.  So,
+ * offset both xids by FirstNormalTransactionId to avoid that.
+ */
+static bool
+GlobalCSNLogPagePrecedes(int page1, int page2)
+{
+    TransactionId xid1;
+    TransactionId xid2;
+
+    xid1 = ((TransactionId) page1) * GCSNLOG_XACTS_PER_PAGE;
+    xid1 += FirstNormalTransactionId;
+    xid2 = ((TransactionId) page2) * GCSNLOG_XACTS_PER_PAGE;
+    xid2 += FirstNormalTransactionId;
+
+    return TransactionIdPrecedes(xid1, xid2);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 306861bb79..3aee5e50c5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/global_csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 394843f7e9..4035b90d5e 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -15,6 +15,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/global_csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/xact.h"
@@ -169,6 +170,7 @@ GetNewTransactionId(bool isSubXact)
      * Extend pg_subtrans and pg_commit_ts too.
      */
     ExtendCLOG(xid);
+    ExtendGlobalCSNLog(xid);
     ExtendCommitTs(xid);
     ExtendSUBTRANS(xid);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..ca0d934c76 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/global_csn_log.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5258,6 +5259,7 @@ BootStrapXLOG(void)
 
     /* Bootstrap the commit log, too */
     BootStrapCLOG();
+    BootStrapGlobalCSNLog();
     BootStrapCommitTs();
     BootStrapSUBTRANS();
     BootStrapMultiXact();
@@ -7066,6 +7068,7 @@ StartupXLOG(void)
              * maintained during recovery and need not be started yet.
              */
             StartupCLOG();
+            StartupGlobalCSNLog(oldestActiveXID);
             StartupSUBTRANS(oldestActiveXID);
 
             /*
@@ -7864,6 +7867,7 @@ StartupXLOG(void)
     if (standbyState == STANDBY_DISABLED)
     {
         StartupCLOG();
+        StartupGlobalCSNLog(oldestActiveXID);
         StartupSUBTRANS(oldestActiveXID);
     }
 
@@ -8527,6 +8531,7 @@ ShutdownXLOG(int code, Datum arg)
         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
     }
     ShutdownCLOG();
+    ShutdownGlobalCSNLog();
     ShutdownCommitTs();
     ShutdownSUBTRANS();
     ShutdownMultiXact();
@@ -9104,7 +9109,10 @@ CreateCheckPoint(int flags)
      * StartupSUBTRANS hasn't been called yet.
      */
     if (!RecoveryInProgress())
+    {
         TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+        TruncateGlobalCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+    }
 
     /* Real work is done, but log and update stats before releasing lock. */
     LogCheckpointEnd(false);
@@ -9180,6 +9188,7 @@ static void
 CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 {
     CheckPointCLOG();
+    CheckPointGlobalCSNLog();
     CheckPointCommitTs();
     CheckPointSUBTRANS();
     CheckPointMultiXact();
@@ -9463,7 +9472,10 @@ CreateRestartPoint(int flags)
      * this because StartupSUBTRANS hasn't been called yet.
      */
     if (EnableHotStandby)
+    {
         TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+        TruncateGlobalCSNLog(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
+    }
 
     /* Real work is done, but log and update before releasing lock. */
     LogCheckpointEnd(true);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..2af468fc6a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/global_csn_log.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, ProcGlobalShmemSize());
         size = add_size(size, XLOGShmemSize());
         size = add_size(size, CLOGShmemSize());
+        size = add_size(size, GlobalCSNLogShmemSize());
         size = add_size(size, CommitTsShmemSize());
         size = add_size(size, SUBTRANSShmemSize());
         size = add_size(size, TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
      */
     XLOGShmemInit();
     CLOGShmemInit();
+    GlobalCSNLogShmemInit();
     CommitTsShmemInit();
     SUBTRANSShmemInit();
     MultiXactShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd20497d81..64ab249615 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -46,6 +46,7 @@
 #include <signal.h>
 
 #include "access/clog.h"
+#include "access/global_csn_log.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -830,6 +831,7 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
     while (TransactionIdPrecedes(latestObservedXid, running->nextXid))
     {
         ExtendSUBTRANS(latestObservedXid);
+        ExtendGlobalCSNLog(latestObservedXid);
         TransactionIdAdvance(latestObservedXid);
     }
     TransactionIdRetreat(latestObservedXid);    /* = running->nextXid - 1 */
@@ -3209,6 +3211,7 @@ RecordKnownAssignedTransactionIds(TransactionId xid)
         while (TransactionIdPrecedes(next_expected_xid, xid))
         {
             TransactionIdAdvance(next_expected_xid);
+            ExtendGlobalCSNLog(next_expected_xid);
             ExtendSUBTRANS(next_expected_xid);
         }
         Assert(next_expected_xid == xid);
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb..9615058f29 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+GlobalCSNLogControlLock                46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a88ea6cfc9..2701528c55 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1019,6 +1019,15 @@ static struct config_bool ConfigureNamesBool[] =
         NULL, NULL, NULL
     },
     {
+        {"track_global_snapshots", PGC_POSTMASTER, RESOURCES_MEM,
+            gettext_noop("Enable global snapshot tracking."),
+            gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
+        },
+        &track_global_snapshots,
+        true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+        NULL, NULL, NULL
+    },
+    {
         {"ssl", PGC_SIGHUP, CONN_AUTH_SSL,
             gettext_noop("Enables SSL connections."),
             NULL
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index ad06e8e2ea..5ebe2ad888 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -77,6 +77,8 @@ provider postgresql {
     probe clog__checkpoint__done(bool);
     probe subtrans__checkpoint__start(bool);
     probe subtrans__checkpoint__done(bool);
+    probe globalcsnlog__checkpoint__start(bool);
+    probe globalcsnlog__checkpoint__done(bool);
     probe multixact__checkpoint__start(bool);
     probe multixact__checkpoint__done(bool);
     probe twophase__checkpoint__start();
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3f203c6ca6..40fceb81f8 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -220,7 +220,8 @@ static const char *const subdirs[] = {
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
-    "pg_logical/mappings"
+    "pg_logical/mappings",
+    "pg_global_csn"
 };
 
 
diff --git a/src/include/access/global_csn_log.h b/src/include/access/global_csn_log.h
new file mode 100644
index 0000000000..417c26c8a3
--- /dev/null
+++ b/src/include/access/global_csn_log.h
@@ -0,0 +1,30 @@
+/*
+ * global_csn_log.h
+ *
+ * Commit-Sequence-Number log.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/global_csn_log.h
+ */
+#ifndef CSNLOG_H
+#define CSNLOG_H
+
+#include "access/xlog.h"
+#include "utils/snapshot.h"
+
+extern void GlobalCSNLogSetCSN(TransactionId xid, int nsubxids,
+                               TransactionId *subxids, GlobalCSN csn);
+extern GlobalCSN GlobalCSNLogGetCSN(TransactionId xid);
+
+extern Size GlobalCSNLogShmemSize(void);
+extern void GlobalCSNLogShmemInit(void);
+extern void BootStrapGlobalCSNLog(void);
+extern void StartupGlobalCSNLog(TransactionId oldestActiveXID);
+extern void ShutdownGlobalCSNLog(void);
+extern void CheckPointGlobalCSNLog(void);
+extern void ExtendGlobalCSNLog(TransactionId newestXact);
+extern void TruncateGlobalCSNLog(TransactionId oldestXact);
+
+#endif   /* CSNLOG_H */
\ No newline at end of file
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f66..ab330b71c2 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -198,6 +198,7 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_CLOG_BUFFERS = NUM_INDIVIDUAL_LWLOCKS,
     LWTRANCHE_COMMITTS_BUFFERS,
     LWTRANCHE_SUBTRANS_BUFFERS,
+    LWTRANCHE_GLOBAL_CSN_LOG_BUFFERS,
     LWTRANCHE_MXACTOFFSET_BUFFERS,
     LWTRANCHE_MXACTMEMBER_BUFFERS,
     LWTRANCHE_ASYNC_BUFFERS,
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index a8a5a8f4c0..318d41e6f7 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -24,6 +24,9 @@ typedef struct SnapshotData *Snapshot;
 
 #define InvalidSnapshot        ((Snapshot) NULL)
 
+typedef uint64 GlobalCSN;
+extern bool track_global_snapshots;
+
 /*
  * We use SnapshotData structures to represent both "regular" (MVCC)
  * snapshots and "special" snapshots that have non-MVCC semantics.
-- 
2.11.0

From 4000408863fca43b1e79c55d638525c0cd04f92a Mon Sep 17 00:00:00 2001
From: Stas Kelvich <stanconn@gmail.com>
Date: Wed, 25 Apr 2018 16:22:30 +0300
Subject: [PATCH 2/3] Global snapshots

---
 src/backend/access/transam/Makefile           |   2 +-
 src/backend/access/transam/global_snapshot.c  | 754 ++++++++++++++++++++++++++
 src/backend/access/transam/twophase.c         | 156 ++++++
 src/backend/access/transam/xact.c             |  29 +
 src/backend/access/transam/xlog.c             |   2 +
 src/backend/storage/ipc/ipci.c                |   3 +
 src/backend/storage/ipc/procarray.c           |  93 +++-
 src/backend/storage/lmgr/lwlocknames.txt      |   1 +
 src/backend/storage/lmgr/proc.c               |   5 +
 src/backend/utils/misc/guc.c                  |  13 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/backend/utils/time/snapmgr.c              | 103 ++++
 src/backend/utils/time/tqual.c                |  65 ++-
 src/include/access/global_snapshot.h          |  72 +++
 src/include/access/twophase.h                 |   1 +
 src/include/catalog/pg_proc.dat               |  14 +
 src/include/datatype/timestamp.h              |   3 +
 src/include/fmgr.h                            |   1 +
 src/include/portability/instr_time.h          |  10 +
 src/include/storage/proc.h                    |  15 +
 src/include/storage/procarray.h               |   8 +
 src/include/utils/snapmgr.h                   |   3 +
 src/include/utils/snapshot.h                  |   8 +
 23 files changed, 1355 insertions(+), 8 deletions(-)
 create mode 100644 src/backend/access/transam/global_snapshot.c
 create mode 100644 src/include/access/global_snapshot.h

diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 03aa360ea3..8ef677cada 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o commit_ts.o global_csn_log.o generic_xlog.o \
+OBJS = clog.o commit_ts.o global_csn_log.o global_snapshot.o generic_xlog.o \
     multixact.o parallel.o rmgr.o slru.o \
     subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
     xact.o xlog.o xlogarchive.o xlogfuncs.o \
diff --git a/src/backend/access/transam/global_snapshot.c b/src/backend/access/transam/global_snapshot.c
new file mode 100644
index 0000000000..b9d6c56334
--- /dev/null
+++ b/src/backend/access/transam/global_snapshot.c
@@ -0,0 +1,754 @@
+/*-------------------------------------------------------------------------
+ *
+ * global_snapshot.c
+ *        Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/global_snapshot.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/global_csn_log.h"
+#include "access/global_snapshot.h"
+#include "access/transam.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "portability/instr_time.h"
+#include "storage/lmgr.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+#include "miscadmin.h"
+
+/* Raise a warning if imported global_csn exceeds ours by this value. */
+#define SNAP_DESYNC_COMPLAIN (1*NSECS_PER_SEC) /* 1 second */
+
+/*
+ * GlobalSnapshotState
+ *
+ * Do not trust local clocks to be strictly monotonical and save last acquired
+ * value so later we can compare next timestamp with it. Accessed through
+ * GlobalSnapshotGenerate() and GlobalSnapshotSync().
+ */
+typedef struct
+{
+    GlobalCSN         last_global_csn;
+    volatile slock_t lock;
+} GlobalSnapshotState;
+
+static GlobalSnapshotState *gsState;
+
+
+/*
+ * GUC to delay advance of oldestXid for this amount of time. Also determines
+ * the size GlobalSnapshotXidMap circular buffer.
+ */
+int global_snapshot_defer_time;
+
+/*
+ * Enables this module.
+ */
+extern bool track_global_snapshots;
+
+/*
+ * GlobalSnapshotXidMap
+ *
+ * To be able to install global snapshot that points to past we need to keep
+ * old versions of tuples and therefore delay advance of oldestXid.  Here we
+ * keep track of correspondence between snapshot's global_csn and oldestXid
+ * that was set at the time when the snapshot was taken.  Much like the
+ * snapshot too old's OldSnapshotControlData does, but with finer granularity
+ * to seconds.
+ *
+ * Different strategies can be employed to hold oldestXid (e.g. we can track
+ * oldest global_csn-based snapshot among cluster nodes and map it oldestXid
+ * on each node) but here implemented one that tries to avoid cross-node
+ * communications which are tricky in case of postgres_fdw.
+ *
+ * On each snapshot acquisition GlobalSnapshotMapXmin() is called and stores
+ * correspondence between current global_csn and oldestXmin in a sparse way:
+ * global_csn is rounded to seconds (and here we use the fact that global_csn
+ * is just a timestamp) and oldestXmin is stored in the circular buffer where
+ * rounded global_csn acts as an offset from current circular buffer head.
+ * Size of the circular buffer is controlled by global_snapshot_defer_time GUC.
+ *
+ * When global snapshot arrives from different node we check that its
+ * global_csn is still in our map, otherwise we'll error out with "snapshot too
+ * old" message.  If global_csn is successfully mapped to oldestXid we move
+ * backend's pgxact->xmin to proc->originalXmin and fill pgxact->xmin to
+ * mapped oldestXid.  That way GetOldestXmin() can take into account backends
+ * with imported global snapshot and old tuple versions will be preserved.
+ *
+ * Also while calculating oldestXmin for our map in presence of imported
+ * global snapshots we should use proc->originalXmin instead of pgxact->xmin
+ * that was set during import.  Otherwise, we can create a feedback loop:
+ * xmin's of imported global snapshots were calculated using our map and new
+ * entries in map going to be calculated based on that xmin's, and there is
+ * a risk to stuck forever with one non-increasing oldestXmin.  All other
+ * callers of GetOldestXmin() are using pgxact->xmin so the old tuple versions
+ * are preserved.
+ */
+typedef struct GlobalSnapshotXidMap
+{
+    int                 head;                /* offset of current freshest value */
+    int                 size;                /* total size of circular buffer */
+    GlobalCSN_atomic last_csn_seconds;    /* last rounded global_csn that changed
+                                         * xmin_by_second[] */
+    TransactionId   *xmin_by_second;    /* circular buffer of oldestXmin's */
+}
+GlobalSnapshotXidMap;
+
+static GlobalSnapshotXidMap *gsXidMap;
+
+
+/* Estimate shared memory space needed */
+Size
+GlobalSnapshotShmemSize(void)
+{
+    Size    size = 0;
+
+    if (track_global_snapshots || global_snapshot_defer_time > 0)
+    {
+        size += MAXALIGN(sizeof(GlobalSnapshotState));
+    }
+
+    if (global_snapshot_defer_time > 0)
+    {
+        size += sizeof(GlobalSnapshotXidMap);
+        size += global_snapshot_defer_time*sizeof(TransactionId);
+        size = MAXALIGN(size);
+    }
+
+    return size;
+}
+
+/* Init shared memory structures */
+void
+GlobalSnapshotShmemInit()
+{
+    bool found;
+
+    if (track_global_snapshots || global_snapshot_defer_time > 0)
+    {
+        gsState = ShmemInitStruct("gsState",
+                                sizeof(GlobalSnapshotState),
+                                &found);
+        if (!found)
+        {
+            gsState->last_global_csn = 0;
+            SpinLockInit(&gsState->lock);
+        }
+    }
+
+    if (global_snapshot_defer_time > 0)
+    {
+        gsXidMap = ShmemInitStruct("gsXidMap",
+                                   sizeof(GlobalSnapshotXidMap),
+                                   &found);
+        if (!found)
+        {
+            int i;
+
+            pg_atomic_init_u64(&gsXidMap->last_csn_seconds, 0);
+            gsXidMap->head = 0;
+            gsXidMap->size = global_snapshot_defer_time;
+            gsXidMap->xmin_by_second =
+                            ShmemAlloc(sizeof(TransactionId)*gsXidMap->size);
+
+            for (i = 0; i < gsXidMap->size; i++)
+                gsXidMap->xmin_by_second[i] = InvalidTransactionId;
+        }
+    }
+}
+
+/*
+ * GlobalSnapshotStartup
+ *
+ * Set gsXidMap entries to oldestActiveXID during startup.
+ */
+void
+GlobalSnapshotStartup(TransactionId oldestActiveXID)
+{
+    /*
+     * Run only if we have initialized shared memory and gsXidMap
+     * is enabled.
+     */
+    if (IsNormalProcessingMode() && global_snapshot_defer_time > 0)
+    {
+        int i;
+
+        Assert(TransactionIdIsValid(oldestActiveXID));
+        for (i = 0; i < gsXidMap->size; i++)
+            gsXidMap->xmin_by_second[i] = oldestActiveXID;
+        ProcArraySetGlobalSnapshotXmin(oldestActiveXID);
+    }
+}
+
+/*
+ * GlobalSnapshotMapXmin
+ *
+ * Maintain circular buffer of oldestXmins for several seconds in past. This
+ * buffer allows to shift oldestXmin in the past when backend is importing
+ * global transaction. Otherwise old versions of tuples that were needed for
+ * this transaction can be recycled by other processes (vacuum, HOT, etc).
+ *
+ * Locking here is not trivial. Called upon each snapshot creation after
+ * ProcArrayLock is released. Such usage creates several race conditions. It
+ * is possible that backend who got global_csn called GlobalSnapshotMapXmin()
+ * only after other backends managed to get snapshot and complete
+ * GlobalSnapshotMapXmin() call, or even committed. This is safe because
+ *
+ *      * We already hold our xmin in MyPgXact, so our snapshot will not be
+ *           harmed even though ProcArrayLock is released.
+ *
+ *        * snapshot_global_csn is always pessmistically rounded up to the next
+ *          second.
+ *
+ *      * For performance reasons, xmin value for particular second is filled
+ *        only once. Because of that instead of writing to buffer just our
+ *        xmin (which is enough for our snapshot), we bump oldestXmin there --
+ *        it mitigates the possibility of damaging someone else's snapshot by
+ *        writing to the buffer too advanced value in case of slowness of
+ *        another backend who generated csn earlier, but didn't manage to
+ *        insert it before us.
+ *
+ *        * if GlobalSnapshotMapXmin() founds a gap in several seconds between
+ *          current call and latest completed call then it should fill that gap
+ *          with latest known values instead of new one. Otherwise it is
+ *          possible (however highly unlikely) that this gap also happend
+ *          between taking snapshot and call to GlobalSnapshotMapXmin() for some
+ *          backend. And we are at risk to fill circullar buffer with
+ *          oldestXmin's that are bigger then they actually were.
+ */
+void
+GlobalSnapshotMapXmin(GlobalCSN snapshot_global_csn)
+{
+    int offset, gap, i;
+    GlobalCSN csn_seconds;
+    GlobalCSN last_csn_seconds;
+    volatile TransactionId oldest_deferred_xmin;
+    TransactionId current_oldest_xmin, previous_oldest_xmin;
+
+    /* Callers should check config values */
+    Assert(global_snapshot_defer_time > 0);
+    Assert(gsXidMap != NULL);
+
+    /*
+     * Round up global_csn to the next second -- pessimistically and safely.
+     */
+    csn_seconds = (snapshot_global_csn / NSECS_PER_SEC + 1);
+
+    /*
+     * Fast-path check. Avoid taking exclusive GlobalSnapshotXidMapLock lock
+     * if oldestXid was already written to xmin_by_second[] for this rounded
+     * global_csn.
+     */
+    if (pg_atomic_read_u64(&gsXidMap->last_csn_seconds) >= csn_seconds)
+        return;
+
+    /* Ok, we have new entry (or entries) */
+    LWLockAcquire(GlobalSnapshotXidMapLock, LW_EXCLUSIVE);
+
+    /* Re-check last_csn_seconds under lock */
+    last_csn_seconds = pg_atomic_read_u64(&gsXidMap->last_csn_seconds);
+    if (last_csn_seconds >= csn_seconds)
+    {
+        LWLockRelease(GlobalSnapshotXidMapLock);
+        return;
+    }
+    pg_atomic_write_u64(&gsXidMap->last_csn_seconds, csn_seconds);
+
+    /*
+     * Count oldest_xmin.
+     *
+     * It was possible to calculate oldest_xmin during corresponding snapshot
+     * creation, but GetSnapshotData() intentionally reads only PgXact, but not
+     * PgProc. And we need info about originalXmin (see comment to gsXidMap)
+     * which is stored in PgProc because of threats in comments around PgXact
+     * about extending it with new fields. So just calculate oldest_xmin again,
+     * that anyway happens quite rarely.
+     */
+    current_oldest_xmin = GetOldestXmin(NULL, PROCARRAY_NON_IMPORTED_XMIN);
+
+    previous_oldest_xmin = gsXidMap->xmin_by_second[gsXidMap->head];
+
+    Assert(TransactionIdIsNormal(current_oldest_xmin));
+    Assert(TransactionIdIsNormal(previous_oldest_xmin));
+
+    gap = csn_seconds - last_csn_seconds;
+    offset = csn_seconds % gsXidMap->size;
+
+    /* Sanity check before we update head and gap */
+    Assert( gap >= 1 );
+    Assert( (gsXidMap->head + gap) % gsXidMap->size == offset );
+
+    gap = gap > gsXidMap->size ? gsXidMap->size : gap;
+    gsXidMap->head = offset;
+
+    /* Fill new entry with current_oldest_xmin */
+    gsXidMap->xmin_by_second[offset] = current_oldest_xmin;
+
+    /*
+     * If we have gap then fill it with previous_oldest_xmin for reasons
+     * outlined in comment above this function.
+     */
+    for (i = 1; i < gap; i++)
+    {
+        offset = (offset + gsXidMap->size - 1) % gsXidMap->size;
+        gsXidMap->xmin_by_second[offset] = previous_oldest_xmin;
+    }
+
+    oldest_deferred_xmin =
+        gsXidMap->xmin_by_second[ (gsXidMap->head + 1) % gsXidMap->size ];
+
+    LWLockRelease(GlobalSnapshotXidMapLock);
+
+    /*
+     * Advance procArray->global_snapshot_xmin after we released
+     * GlobalSnapshotXidMapLock. Since we gather not xmin but oldestXmin, it
+     * never goes backwards regardless of how slow we can do that.
+     */
+    Assert(TransactionIdFollowsOrEquals(oldest_deferred_xmin,
+                                        ProcArrayGetGlobalSnapshotXmin()));
+    ProcArraySetGlobalSnapshotXmin(oldest_deferred_xmin);
+}
+
+
+/*
+ * GlobalSnapshotToXmin
+ *
+ * Get oldestXmin that took place when snapshot_global_csn was taken.
+ */
+TransactionId
+GlobalSnapshotToXmin(GlobalCSN snapshot_global_csn)
+{
+    TransactionId xmin;
+    GlobalCSN csn_seconds;
+    volatile GlobalCSN last_csn_seconds;
+
+    /* Callers should check config values */
+    Assert(global_snapshot_defer_time > 0);
+    Assert(gsXidMap != NULL);
+
+    /* Round down to get conservative estimates */
+    csn_seconds = (snapshot_global_csn / NSECS_PER_SEC);
+
+    LWLockAcquire(GlobalSnapshotXidMapLock, LW_SHARED);
+    last_csn_seconds = pg_atomic_read_u64(&gsXidMap->last_csn_seconds);
+    if (csn_seconds > last_csn_seconds)
+    {
+        /* we don't have entry for this global_csn yet, return latest known */
+        xmin = gsXidMap->xmin_by_second[gsXidMap->head];
+    }
+    else if (last_csn_seconds - csn_seconds < gsXidMap->size)
+    {
+        /* we are good, retrieve value from our map */
+        Assert(last_csn_seconds % gsXidMap->size == gsXidMap->head);
+        xmin = gsXidMap->xmin_by_second[csn_seconds % gsXidMap->size];
+    }
+    else
+    {
+        /* requested global_csn is too old, let caller know */
+        xmin = InvalidTransactionId;
+    }
+    LWLockRelease(GlobalSnapshotXidMapLock);
+
+    return xmin;
+}
+
+/*
+ * GlobalSnapshotGenerate
+ *
+ * Generate GlobalCSN which is actually a local time. Also we are forcing
+ * this time to be always increasing. Since now it is not uncommon to have
+ * millions of read transactions per second we are trying to use nanoseconds
+ * if such time resolution is available.
+ */
+GlobalCSN
+GlobalSnapshotGenerate(bool locked)
+{
+    instr_time    current_time;
+    GlobalCSN    global_csn;
+
+    Assert(track_global_snapshots || global_snapshot_defer_time > 0);
+
+    /*
+     * TODO: create some macro that add small random shift to current time.
+     */
+    INSTR_TIME_SET_CURRENT(current_time);
+    global_csn = (GlobalCSN) INSTR_TIME_GET_NANOSEC(current_time);
+
+    /* TODO: change to atomics? */
+    if (!locked)
+        SpinLockAcquire(&gsState->lock);
+
+    if (global_csn <= gsState->last_global_csn)
+        global_csn = ++gsState->last_global_csn;
+    else
+        gsState->last_global_csn = global_csn;
+
+    if (!locked)
+        SpinLockRelease(&gsState->lock);
+
+    return global_csn;
+}
+
+/*
+ * GlobalSnapshotSync
+ *
+ * Due to time desynchronization on different nodes we can receive global_csn
+ * which is greater than global_csn on this node. To preserve proper isolation
+ * this node needs to wait when such global_csn comes on local clock.
+ *
+ * This should happend relatively rare if nodes have running NTP/PTP/etc.
+ * Complain if wait time is more than SNAP_SYNC_COMPLAIN.
+ */
+void
+GlobalSnapshotSync(GlobalCSN remote_gcsn)
+{
+    GlobalCSN    local_gcsn;
+    GlobalCSN    delta;
+
+    Assert(track_global_snapshots);
+
+    for(;;)
+    {
+        SpinLockAcquire(&gsState->lock);
+        if (gsState->last_global_csn > remote_gcsn)
+        {
+            /* Everything is fine */
+            SpinLockRelease(&gsState->lock);
+            return;
+        }
+        else if ((local_gcsn = GlobalSnapshotGenerate(true)) >= remote_gcsn)
+        {
+            /*
+             * Everything is fine too, but last_global_csn wasn't updated for
+             * some time.
+             */
+            SpinLockRelease(&gsState->lock);
+            return;
+        }
+        SpinLockRelease(&gsState->lock);
+
+        /* Okay we need to sleep now */
+        delta = remote_gcsn - local_gcsn;
+        if (delta > SNAP_DESYNC_COMPLAIN)
+            ereport(WARNING,
+                (errmsg("remote global snapshot exceeds ours by more than a second"),
+                 errhint("Consider running NTPd on servers participating in global transaction")));
+
+        /* TODO: report this sleeptime somewhere? */
+        pg_usleep((long) (delta/NSECS_PER_USEC));
+
+        /*
+         * Loop that checks to ensure that we actually slept for specified
+         * amount of time.
+         */
+    }
+
+    Assert(false); /* Should not happend */
+    return;
+}
+
+/*
+ * TransactionIdGetGlobalCSN
+ *
+ * Get GlobalCSN for specified TransactionId taking care about special xids,
+ * xids beyond TransactionXmin and InDoubt states.
+ */
+GlobalCSN
+TransactionIdGetGlobalCSN(TransactionId xid)
+{
+    GlobalCSN global_csn;
+
+    Assert(track_global_snapshots);
+
+    /* Handle permanent TransactionId's for which we don't have mapping */
+    if (!TransactionIdIsNormal(xid))
+    {
+        if (xid == InvalidTransactionId)
+            return AbortedGlobalCSN;
+        if (xid == FrozenTransactionId || xid == BootstrapTransactionId)
+            return FrozenGlobalCSN;
+        Assert(false); /* Should not happend */
+    }
+
+    /*
+     * For xids which less then TransactionXmin GlobalCSNLog can be already
+     * trimmed but we know that such transaction is definetly not concurrently
+     * running according to any snapshot including timetravel ones. Callers
+     * should check TransactionDidCommit after.
+     */
+    if (TransactionIdPrecedes(xid, TransactionXmin))
+        return FrozenGlobalCSN;
+
+    /* Read GlobalCSN from SLRU */
+    global_csn = GlobalCSNLogGetCSN(xid);
+
+    /*
+     * If we faced InDoubt state then transaction is beeing committed and we
+     * should wait until GlobalCSN will be assigned so that visibility check
+     * could decide whether tuple is in snapshot. See also comments in
+     * GlobalSnapshotPrecommit().
+     */
+    if (GlobalCSNIsInDoubt(global_csn))
+    {
+        XactLockTableWait(xid, NULL, NULL, XLTW_None);
+        global_csn = GlobalCSNLogGetCSN(xid);
+        Assert(GlobalCSNIsNormal(global_csn) ||
+                GlobalCSNIsAborted(global_csn));
+    }
+
+    Assert(GlobalCSNIsNormal(global_csn) ||
+            GlobalCSNIsInProgress(global_csn) ||
+            GlobalCSNIsAborted(global_csn));
+
+    return global_csn;
+}
+
+/*
+ * XidInvisibleInGlobalSnapshot
+ *
+ * Version of XidInMVCCSnapshot for global transactions. For non-imported
+ * global snapshots this should give same results as XidInLocalMVCCSnapshot
+ * (except that aborts will be shown as invisible without going to clog) and to
+ * ensure such behaviour XidInMVCCSnapshot is coated with asserts that checks
+ * identicalness of XidInvisibleInGlobalSnapshot/XidInLocalMVCCSnapshot in
+ * case of ordinary snapshot.
+ */
+bool
+XidInvisibleInGlobalSnapshot(TransactionId xid, Snapshot snapshot)
+{
+    GlobalCSN csn;
+
+    Assert(track_global_snapshots);
+
+    csn = TransactionIdGetGlobalCSN(xid);
+
+    if (GlobalCSNIsNormal(csn))
+    {
+        if (csn < snapshot->global_csn)
+            return false;
+        else
+            return true;
+    }
+    else if (GlobalCSNIsFrozen(csn))
+    {
+        /* It is bootstrap or frozen transaction */
+        return false;
+    }
+    else
+    {
+        /* It is aborted or in-progress */
+        Assert(GlobalCSNIsAborted(csn) || GlobalCSNIsInProgress(csn));
+        if (GlobalCSNIsAborted(csn))
+            Assert(TransactionIdDidAbort(xid));
+        return true;
+    }
+}
+
+
+/*****************************************************************************
+ * Functions to handle distributed commit on transaction coordinator:
+ * GlobalSnapshotPrepareCurrent() / GlobalSnapshotAssignCsnCurrent().
+ * Correspoding functions for remote nodes are defined in twophase.c:
+ * pg_global_snapshot_prepare/pg_global_snapshot_assign.
+ *****************************************************************************/
+
+
+/*
+ * GlobalSnapshotPrepareCurrent
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ */
+GlobalCSN
+GlobalSnapshotPrepareCurrent()
+{
+    TransactionId xid = GetCurrentTransactionIdIfAny();
+
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("could not prepare transaction for global commit"),
+                errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                        "track_global_snapshots")));
+
+    if (TransactionIdIsValid(xid))
+    {
+        TransactionId *subxids;
+        int nsubxids = xactGetCommittedChildren(&subxids);
+        GlobalCSNLogSetCSN(xid, nsubxids,
+                                    subxids, InDoubtGlobalCSN);
+    }
+
+    /* Nothing to write if we don't heve xid */
+
+    return GlobalSnapshotGenerate(false);
+}
+
+/*
+ * GlobalSnapshotAssignCsnCurrent
+ *
+ * Asign GlobalCSN for currently active transaction. GlobalCSN is supposedly
+ * maximal among of values returned by GlobalSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ */
+void
+GlobalSnapshotAssignCsnCurrent(GlobalCSN global_csn)
+{
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("could not prepare transaction for global commit"),
+                errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                        "track_global_snapshots")));
+
+    if (!GlobalCSNIsNormal(global_csn))
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("pg_global_snapshot_assign expects normal global_csn")));
+
+    /* Skip emtpty transactions */
+    if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+        return;
+
+    /* Set global_csn and defuse ProcArrayEndTransaction from assigning one */
+    pg_atomic_write_u64(&MyProc->assignedGlobalCsn, global_csn);
+}
+
+
+/*****************************************************************************
+ * Functions to handle global and local transactions commit.
+ *
+ * For local transactions GlobalSnapshotPrecommit sets InDoubt state before
+ * ProcArrayEndTransaction is called and transaction data potetntially becomes
+ * visible to other backends. ProcArrayEndTransaction (or ProcArrayRemove in
+ * twophase case) then acquires global_csn under ProcArray lock and stores it
+ * in proc->assignedGlobalCsn. It's important that global_csn for commit is
+ * generated under ProcArray lock, otherwise global and local snapshots won't
+ * be equivalent. Consequent call to GlobalSnapshotCommit will write
+ * proc->assignedGlobalCsn to GlobalCSNLog.
+ *
+ * Same rules applies to global transaction, except that global_csn is already
+ * assigned by GlobalSnapshotAssignCsnCurrent/pg_global_snapshot_assign and
+ * GlobalSnapshotPrecommit is basically no-op.
+ *
+ * GlobalSnapshotAbort is slightly different comparing to commit because abort
+ * can skip InDoubt phase and can be called for transaction subtree.
+ *****************************************************************************/
+
+
+/*
+ * GlobalSnapshotAbort
+ *
+ * Abort transaction in GlobalCsnLog. We can skip InDoubt state for aborts
+ * since no concurrent transactions allowed to see aborted data anyway.
+ */
+void
+GlobalSnapshotAbort(PGPROC *proc, TransactionId xid,
+                    int nsubxids, TransactionId *subxids)
+{
+    if (!track_global_snapshots)
+        return;
+
+    GlobalCSNLogSetCSN(xid, nsubxids, subxids, AbortedGlobalCSN);
+
+    /*
+     * Clean assignedGlobalCsn anyway, as it was possibly set in
+     * GlobalSnapshotAssignCsnCurrent.
+     */
+    pg_atomic_write_u64(&proc->assignedGlobalCsn, InProgressGlobalCSN);
+}
+
+/*
+ * GlobalSnapshotPrecommit
+ *
+ * Set InDoubt status for local transaction that we are going to commit.
+ * This step is needed to achieve consistency between local snapshots and
+ * global csn-based snapshots. We don't hold ProcArray lock while writing
+ * csn for transaction in SLRU but instead we set InDoubt status before
+ * transaction is deleted from ProcArray so the readers who will read csn
+ * in the gap between ProcArray removal and GlobalCSN assignment can wait
+ * until GlobalCSN is finally assigned. See also TransactionIdGetGlobalCSN().
+ *
+ * For global transaction this does nothing as InDoubt state was written
+ * earlier.
+ *
+ * This should be called only from parallel group leader before backend is
+ * deleted from ProcArray.
+ */
+void
+GlobalSnapshotPrecommit(PGPROC *proc, TransactionId xid,
+                    int nsubxids, TransactionId *subxids)
+{
+    GlobalCSN oldAssignedGlobalCsn = InProgressGlobalCSN;
+    bool in_progress;
+
+    if (!track_global_snapshots)
+        return;
+
+    /* Set InDoubt status if it is local transaction */
+    in_progress = pg_atomic_compare_exchange_u64(&proc->assignedGlobalCsn,
+                                                 &oldAssignedGlobalCsn,
+                                                 InDoubtGlobalCSN);
+    if (in_progress)
+    {
+        Assert(GlobalCSNIsInProgress(oldAssignedGlobalCsn));
+        GlobalCSNLogSetCSN(xid, nsubxids,
+                           subxids, InDoubtGlobalCSN);
+    }
+    else
+    {
+        /* Otherwise we should have valid GlobalCSN by this time */
+        Assert(GlobalCSNIsNormal(oldAssignedGlobalCsn));
+        /* Also global transaction should already be in InDoubt state */
+        Assert(GlobalCSNIsInDoubt(GlobalCSNLogGetCSN(xid)));
+    }
+}
+
+/*
+ * GlobalSnapshotCommit
+ *
+ * Write GlobalCSN that were acquired earlier to GlobalCsnLog. Should be
+ * preceded by GlobalSnapshotPrecommit() so readers can wait until we finally
+ * finished writing to SLRU.
+ *
+ * Should be called after ProcArrayEndTransaction, but before releasing
+ * transaction locks, so that TransactionIdGetGlobalCSN can wait on this
+ * lock for GlobalCSN.
+ */
+void
+GlobalSnapshotCommit(PGPROC *proc, TransactionId xid,
+                    int nsubxids, TransactionId *subxids)
+{
+    volatile GlobalCSN assigned_global_csn;
+
+    if (!track_global_snapshots)
+        return;
+
+    if (!TransactionIdIsValid(xid))
+    {
+        assigned_global_csn = pg_atomic_read_u64(&proc->assignedGlobalCsn);
+        Assert(GlobalCSNIsInProgress(assigned_global_csn));
+        return;
+    }
+
+    /* Finally write resulting GlobalCSN in SLRU */
+    assigned_global_csn = pg_atomic_read_u64(&proc->assignedGlobalCsn);
+    Assert(GlobalCSNIsNormal(assigned_global_csn));
+    GlobalCSNLogSetCSN(xid, nsubxids,
+                           subxids, assigned_global_csn);
+
+    /* Reset for next transaction */
+    pg_atomic_write_u64(&proc->assignedGlobalCsn, InProgressGlobalCSN);
+}
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 3aee5e50c5..2fe8b93617 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/global_snapshot.h"
 #include "access/global_csn_log.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
@@ -1526,9 +1527,35 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
                                        hdr->nabortrels, abortrels,
                                        gid);
 
+    /*
+     * GlobalSnapshot callbacks that should be called right before we are
+     * going to become visible. Details in comments to this functions.
+     */
+    if (isCommit)
+        GlobalSnapshotPrecommit(proc, xid, hdr->nsubxacts, children);
+    else
+        GlobalSnapshotAbort(proc, xid, hdr->nsubxacts, children);
+
+
     ProcArrayRemove(proc, latestXid);
 
     /*
+     * Stamp our transaction with GlobalCSN in GlobalCsnLog.
+     * Should be called after ProcArrayEndTransaction, but before releasing
+     * transaction locks, since TransactionIdGetGlobalCSN relies on
+     * XactLockTableWait to await global_csn.
+     */
+    if (isCommit)
+    {
+        GlobalSnapshotCommit(proc, xid, hdr->nsubxacts, children);
+    }
+    else
+    {
+        Assert(GlobalCSNIsInProgress(
+                   pg_atomic_read_u64(&proc->assignedGlobalCsn)));
+    }
+
+    /*
      * In case we fail while running the callbacks, mark the gxact invalid so
      * no one else will try to commit/rollback, and so it will be recycled if
      * we fail after this point.  It is still locked by our backend so it
@@ -2513,3 +2540,132 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 
     return;
 }
+
+/*
+ * GlobalSnapshotPrepareTwophase
+ *
+ * Set InDoubt state for currently active transaction and return commit's
+ * global snapshot.
+ *
+ * This function is a counterpart of GlobalSnapshotPrepareCurrent() for
+ * twophase transactions.
+ */
+static GlobalCSN
+GlobalSnapshotPrepareTwophase(const char *gid)
+{
+    GlobalTransaction gxact;
+    PGXACT       *pgxact;
+    char       *buf;
+    TransactionId xid;
+    xl_xact_parsed_prepare parsed;
+
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("could not prepare transaction for global commit"),
+                errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                        "track_global_snapshots")));
+
+    /*
+     * Validate the GID, and lock the GXACT to ensure that two backends do not
+     * try to access the same GID at once.
+     */
+    gxact = LockGXact(gid, GetUserId());
+    pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+    xid = pgxact->xid;
+
+    if (gxact->ondisk)
+        buf = ReadTwoPhaseFile(xid, true);
+    else
+        XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
+
+    ParsePrepareRecord(0, buf, &parsed);
+
+    GlobalCSNLogSetCSN(xid, parsed.nsubxacts,
+                    parsed.subxacts, InDoubtGlobalCSN);
+
+    /* Unlock our GXACT */
+    LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+    gxact->locking_backend = InvalidBackendId;
+    LWLockRelease(TwoPhaseStateLock);
+
+    pfree(buf);
+
+    return GlobalSnapshotGenerate(false);
+}
+
+/*
+ * SQL interface to GlobalSnapshotPrepareTwophase()
+ *
+ * TODO: Rewrite this as PREPARE TRANSACTION 'gid' RETURNING SNAPSHOT
+ */
+Datum
+pg_global_snapshot_prepare(PG_FUNCTION_ARGS)
+{
+    const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+    GlobalCSN    global_csn;
+
+    global_csn = GlobalSnapshotPrepareTwophase(gid);
+
+    PG_RETURN_INT64(global_csn);
+}
+
+
+/*
+ * TwoPhaseAssignGlobalCsn
+ *
+ * Asign GlobalCSN for currently active transaction. GlobalCSN is supposedly
+ * maximal among of values returned by GlobalSnapshotPrepareCurrent and
+ * pg_global_snapshot_prepare.
+ *
+ * This function is a counterpart of GlobalSnapshotAssignCsnCurrent() for
+ * twophase transactions.
+ */
+static void
+GlobalSnapshotAssignCsnTwoPhase(const char *gid, GlobalCSN global_csn)
+{
+    GlobalTransaction gxact;
+    PGPROC       *proc;
+
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                errmsg("could not prepare transaction for global commit"),
+                errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                        "track_global_snapshots")));
+
+    if (!GlobalCSNIsNormal(global_csn))
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("pg_global_snapshot_assign expects normal global_csn")));
+
+    /*
+     * Validate the GID, and lock the GXACT to ensure that two backends do not
+     * try to access the same GID at once.
+     */
+    gxact = LockGXact(gid, GetUserId());
+    proc = &ProcGlobal->allProcs[gxact->pgprocno];
+
+    /* Set global_csn and defuse ProcArrayRemove from assigning one. */
+    pg_atomic_write_u64(&proc->assignedGlobalCsn, global_csn);
+
+    /* Unlock our GXACT */
+    LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+    gxact->locking_backend = InvalidBackendId;
+    LWLockRelease(TwoPhaseStateLock);
+}
+
+/*
+ * SQL interface to GlobalSnapshotAssignCsnTwoPhase()
+ *
+ * TODO: Rewrite this as COMMIT PREPARED 'gid' SNAPSHOT 'global_csn'
+ */
+Datum
+pg_global_snapshot_assign(PG_FUNCTION_ARGS)
+{
+    const char *gid = text_to_cstring(PG_GETARG_TEXT_PP(0));
+    GlobalCSN    global_csn = PG_GETARG_INT64(1);
+
+    GlobalSnapshotAssignCsnTwoPhase(gid, global_csn);
+    PG_RETURN_VOID();
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9aa63c8792..0086adadf1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/global_snapshot.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1341,6 +1342,14 @@ RecordTransactionCommit(void)
 
     /* Reset XactLastRecEnd until the next transaction writes something */
     XactLastRecEnd = 0;
+
+    /*
+     * Mark our transaction as InDoubt in GlobalCsnLog and get ready for
+     * commit.
+     */
+    if (markXidCommitted)
+        GlobalSnapshotPrecommit(MyProc, xid, nchildren, children);
+
 cleanup:
     /* Clean up local data */
     if (rels)
@@ -1602,6 +1611,11 @@ RecordTransactionAbort(bool isSubXact)
      */
     TransactionIdAbortTree(xid, nchildren, children);
 
+    /*
+     * Mark our transaction as Aborted in GlobalCsnLog.
+     */
+    GlobalSnapshotAbort(MyProc, xid, nchildren, children);
+
     END_CRIT_SECTION();
 
     /* Compute latestXid while we have the child XIDs handy */
@@ -2060,6 +2074,21 @@ CommitTransaction(void)
     ProcArrayEndTransaction(MyProc, latestXid);
 
     /*
+     * Stamp our transaction with GlobalCSN in GlobalCsnLog.
+     * Should be called after ProcArrayEndTransaction, but before releasing
+     * transaction locks.
+     */
+    if (!is_parallel_worker)
+    {
+        TransactionId  xid = GetTopTransactionIdIfAny();
+        TransactionId *subxids;
+        int               nsubxids;
+
+        nsubxids = xactGetCommittedChildren(&subxids);
+        GlobalSnapshotCommit(MyProc, xid, nsubxids, subxids);
+    }
+
+    /*
      * This is all post-commit cleanup.  Note that if an error is raised here,
      * it's too late to abort the transaction.  This should be just
      * noncritical resource releasing.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ca0d934c76..edb0d07aca 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7070,6 +7070,7 @@ StartupXLOG(void)
             StartupCLOG();
             StartupGlobalCSNLog(oldestActiveXID);
             StartupSUBTRANS(oldestActiveXID);
+            GlobalSnapshotStartup(oldestActiveXID);
 
             /*
              * If we're beginning at a shutdown checkpoint, we know that
@@ -7869,6 +7870,7 @@ StartupXLOG(void)
         StartupCLOG();
         StartupGlobalCSNLog(oldestActiveXID);
         StartupSUBTRANS(oldestActiveXID);
+        GlobalSnapshotStartup(oldestActiveXID);
     }
 
     /*
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2af468fc6a..3a04f6a824 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -22,6 +22,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/global_snapshot.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -147,6 +148,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, WalSndShmemSize());
         size = add_size(size, WalRcvShmemSize());
         size = add_size(size, ApplyLauncherShmemSize());
+        size = add_size(size, GlobalSnapshotShmemSize());
         size = add_size(size, SnapMgrShmemSize());
         size = add_size(size, BTreeShmemSize());
         size = add_size(size, SyncScanShmemSize());
@@ -273,6 +275,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    GlobalSnapshotShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 64ab249615..fb7f74d4cd 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -47,6 +47,7 @@
 
 #include "access/clog.h"
 #include "access/global_csn_log.h"
+#include "access/global_snapshot.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -91,6 +92,8 @@ typedef struct ProcArrayStruct
     TransactionId replication_slot_xmin;
     /* oldest catalog xmin of any replication slot */
     TransactionId replication_slot_catalog_xmin;
+    /* xmin of oldest active global snapshot */
+    TransactionId global_snapshot_xmin;
 
     /* indexes into allPgXact[], has PROCARRAY_MAXPROCS entries */
     int            pgprocnos[FLEXIBLE_ARRAY_MEMBER];
@@ -246,6 +249,7 @@ CreateSharedProcArray(void)
         procArray->lastOverflowedXid = InvalidTransactionId;
         procArray->replication_slot_xmin = InvalidTransactionId;
         procArray->replication_slot_catalog_xmin = InvalidTransactionId;
+        procArray->global_snapshot_xmin = InvalidTransactionId;
     }
 
     allProcs = ProcGlobal->allProcs;
@@ -352,6 +356,17 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
         if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
                                   latestXid))
             ShmemVariableCache->latestCompletedXid = latestXid;
+
+        /*
+         * Assign global csn while holding ProcArrayLock for non-global
+         * COMMIT PREPARED. After lock is released consequent
+         * GlobalSnapshotCommit() will write this value to GlobalCsnLog.
+         *
+         * In case of global commit proc->assignedGlobalCsn is already set
+         * by prior AssignGlobalCsn().
+         */
+        if (GlobalCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedGlobalCsn)))
+            pg_atomic_write_u64(&proc->assignedGlobalCsn, GlobalSnapshotGenerate(false));
     }
     else
     {
@@ -432,6 +447,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
         proc->lxid = InvalidLocalTransactionId;
         pgxact->xmin = InvalidTransactionId;
+        proc->originalXmin = InvalidTransactionId;
+
         /* must be cleared with xid/xmin: */
         pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
         pgxact->delayChkpt = false; /* be sure this is cleared in abort */
@@ -454,6 +471,8 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     pgxact->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     pgxact->xmin = InvalidTransactionId;
+    proc->originalXmin = InvalidTransactionId;
+
     /* must be cleared with xid/xmin: */
     pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
     pgxact->delayChkpt = false; /* be sure this is cleared in abort */
@@ -467,6 +486,20 @@ ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
     if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
                               latestXid))
         ShmemVariableCache->latestCompletedXid = latestXid;
+
+    /*
+     * Assign global csn while holding ProcArrayLock for non-global
+     * COMMIT. After lock is released consequent GlobalSnapshotFinish() will
+     * write this value to GlobalCsnLog.
+     *
+     * In case of global commit MyProc->assignedGlobalCsn is already set
+     * by prior AssignGlobalCsn().
+     *
+     * TODO: in case of group commit we can generate one GlobalSnapshot for
+     * whole group to save time on timestamp aquisition.
+     */
+    if (GlobalCSNIsInDoubt(pg_atomic_read_u64(&proc->assignedGlobalCsn)))
+        pg_atomic_write_u64(&proc->assignedGlobalCsn, GlobalSnapshotGenerate(false));
 }
 
 /*
@@ -616,6 +649,7 @@ ProcArrayClearTransaction(PGPROC *proc)
     pgxact->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     pgxact->xmin = InvalidTransactionId;
+    proc->originalXmin = InvalidTransactionId;
     proc->recoveryConflictPending = false;
 
     /* redundant, but just in case */
@@ -1320,6 +1354,7 @@ GetOldestXmin(Relation rel, int flags)
 
     volatile TransactionId replication_slot_xmin = InvalidTransactionId;
     volatile TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+    volatile TransactionId global_snapshot_xmin = InvalidTransactionId;
 
     /*
      * If we're not computing a relation specific limit, or if a shared
@@ -1356,8 +1391,9 @@ GetOldestXmin(Relation rel, int flags)
             proc->databaseId == MyDatabaseId ||
             proc->databaseId == 0)    /* always include WalSender */
         {
-            /* Fetch xid just once - see GetNewTransactionId */
+            /* Fetch both xids just once - see GetNewTransactionId */
             TransactionId xid = pgxact->xid;
+            TransactionId original_xmin = proc->originalXmin;
 
             /* First consider the transaction's own Xid, if any */
             if (TransactionIdIsNormal(xid) &&
@@ -1370,8 +1406,17 @@ GetOldestXmin(Relation rel, int flags)
              * We must check both Xid and Xmin because a transaction might
              * have an Xmin but not (yet) an Xid; conversely, if it has an
              * Xid, that could determine some not-yet-set Xmin.
+             *
+             * In case of oldestXmin calculation for GlobalSnapshotMapXmin()
+             * pgxact->xmin should be changed to proc->originalXmin. Details
+             * in commets to GlobalSnapshotMapXmin.
              */
-            xid = pgxact->xmin; /* Fetch just once */
+            if ((flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+                    TransactionIdIsValid(original_xmin))
+                xid = original_xmin;
+            else
+                xid = pgxact->xmin; /* Fetch just once */
+
             if (TransactionIdIsNormal(xid) &&
                 TransactionIdPrecedes(xid, result))
                 result = xid;
@@ -1381,6 +1426,7 @@ GetOldestXmin(Relation rel, int flags)
     /* fetch into volatile var while ProcArrayLock is held */
     replication_slot_xmin = procArray->replication_slot_xmin;
     replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+    global_snapshot_xmin = procArray->global_snapshot_xmin;
 
     if (RecoveryInProgress())
     {
@@ -1422,6 +1468,11 @@ GetOldestXmin(Relation rel, int flags)
             result = FirstNormalTransactionId;
     }
 
+    if (!(flags & PROCARRAY_NON_IMPORTED_XMIN) &&
+        TransactionIdIsValid(global_snapshot_xmin) &&
+        NormalTransactionIdPrecedes(global_snapshot_xmin, result))
+        result = global_snapshot_xmin;
+
     /*
      * Check whether there are replication slots requiring an older xmin.
      */
@@ -1517,6 +1568,8 @@ GetSnapshotData(Snapshot snapshot)
     bool        suboverflowed = false;
     volatile TransactionId replication_slot_xmin = InvalidTransactionId;
     volatile TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
+    volatile TransactionId global_snapshot_xmin = InvalidTransactionId;
+    volatile GlobalCSN       global_csn = FrozenGlobalCSN;
 
     Assert(snapshot != NULL);
 
@@ -1705,10 +1758,18 @@ GetSnapshotData(Snapshot snapshot)
     /* fetch into volatile var while ProcArrayLock is held */
     replication_slot_xmin = procArray->replication_slot_xmin;
     replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
+    global_snapshot_xmin = procArray->global_snapshot_xmin;
 
     if (!TransactionIdIsValid(MyPgXact->xmin))
         MyPgXact->xmin = TransactionXmin = xmin;
 
+    /*
+     * Take GlobalCSN under ProcArrayLock so the local/global snapshot stays
+     * synchronized.
+     */
+    if (track_global_snapshots)
+        global_csn = GlobalSnapshotGenerate(false);
+
     LWLockRelease(ProcArrayLock);
 
     /*
@@ -1724,6 +1785,10 @@ GetSnapshotData(Snapshot snapshot)
     if (!TransactionIdIsNormal(RecentGlobalXmin))
         RecentGlobalXmin = FirstNormalTransactionId;
 
+    if (TransactionIdIsValid(global_snapshot_xmin) &&
+        TransactionIdPrecedes(global_snapshot_xmin, RecentGlobalXmin))
+        RecentGlobalXmin = global_snapshot_xmin;
+
     /* Check whether there's a replication slot requiring an older xmin. */
     if (TransactionIdIsValid(replication_slot_xmin) &&
         NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
@@ -1779,6 +1844,12 @@ GetSnapshotData(Snapshot snapshot)
         MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
     }
 
+    snapshot->imported_global_csn = false;
+    snapshot->global_csn = global_csn;
+    /* if (global_snapshot_defer_time > 0 && IsNormalProcessingMode()) */
+    if (global_snapshot_defer_time > 0 && IsUnderPostmaster)
+        GlobalSnapshotMapXmin(snapshot->global_csn);
+
     return snapshot;
 }
 
@@ -3007,6 +3078,24 @@ ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
     LWLockRelease(ProcArrayLock);
 }
 
+/*
+ * ProcArraySetGlobalSnapshotXmin
+ */
+void
+ProcArraySetGlobalSnapshotXmin(TransactionId xmin)
+{
+    /* We rely on atomic fetch/store of xid */
+    procArray->global_snapshot_xmin = xmin;
+}
+
+/*
+ * ProcArrayGetGlobalSnapshotXmin
+ */
+TransactionId
+ProcArrayGetGlobalSnapshotXmin(void)
+{
+    return procArray->global_snapshot_xmin;
+}
 
 #define XidCacheRemove(i) \
     do { \
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index 9615058f29..8ef731d560 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -51,3 +51,4 @@ BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
 GlobalCSNLogControlLock                46
+GlobalSnapshotXidMapLock            47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e082b2..ed497185be 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -37,6 +37,7 @@
 
 #include "access/transam.h"
 #include "access/twophase.h"
+#include "access/global_snapshot.h"
 #include "access/xact.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -417,6 +418,9 @@ InitProcess(void)
     MyProc->clogGroupMemberLsn = InvalidXLogRecPtr;
     pg_atomic_init_u32(&MyProc->clogGroupNext, INVALID_PGPROCNO);
 
+    MyProc->originalXmin = InvalidTransactionId;
+    pg_atomic_init_u64(&MyProc->assignedGlobalCsn, InProgressGlobalCSN);
+
     /*
      * Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
      * on it.  That allows us to repoint the process latch, which so far
@@ -559,6 +563,7 @@ InitAuxiliaryProcess(void)
     MyProc->lwWaitMode = 0;
     MyProc->waitLock = NULL;
     MyProc->waitProcLock = NULL;
+    MyProc->originalXmin = InvalidTransactionId;
 #ifdef USE_ASSERT_CHECKING
     {
         int            i;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2701528c55..434b672ee2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -28,6 +28,7 @@
 
 #include "access/commit_ts.h"
 #include "access/gin.h"
+#include "access/global_snapshot.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -1024,7 +1025,7 @@ static struct config_bool ConfigureNamesBool[] =
             gettext_noop("Used to achieve REPEATEBLE READ isolation level for postgres_fdw transactions.")
         },
         &track_global_snapshots,
-        true, /* XXX: set true to simplify tesing. XXX2: Seems that RESOURCES_MEM isn't the best catagory */
+        false, /* XXX: Seems that RESOURCES_MEM isn't the best catagory */
         NULL, NULL, NULL
     },
     {
@@ -2349,6 +2350,16 @@ static struct config_int ConfigureNamesInt[] =
         NULL, NULL, NULL
     },
 
+    {
+        {"global_snapshot_defer_time", PGC_POSTMASTER, REPLICATION_MASTER,
+            gettext_noop("Minimal age of records which allowed to be vacuumed, in seconds."),
+            NULL
+        },
+        &global_snapshot_defer_time,
+        5, 0, INT_MAX,
+        NULL, NULL, NULL
+    },
+
     /*
      * See also CheckRequiredParameterValues() if this parameter changes
      */
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c0d3fb8491..ed1237f5c0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -254,6 +254,8 @@
                 # and comma-separated list of application_name
                 # from standby(s); '*' = all
 #vacuum_defer_cleanup_age = 0    # number of xacts by which cleanup is delayed
+#global_snapshot_defer_time = 0    # minimal age of records which allowed to be
+                # vacuumed, in seconds
 
 # - Standby Servers -
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index edf59efc29..d4cf0710fc 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -48,6 +48,7 @@
 #include <sys/stat.h>
 #include <unistd.h>
 
+#include "access/global_snapshot.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlog.h"
@@ -245,6 +246,8 @@ typedef struct SerializedSnapshotData
     CommandId    curcid;
     TimestampTz whenTaken;
     XLogRecPtr    lsn;
+    GlobalCSN    global_csn;
+    bool        imported_global_csn;
 } SerializedSnapshotData;
 
 Size
@@ -992,7 +995,9 @@ SnapshotResetXmin(void)
                                         pairingheap_first(&RegisteredSnapshots));
 
     if (TransactionIdPrecedes(MyPgXact->xmin, minSnapshot->xmin))
+    {
         MyPgXact->xmin = minSnapshot->xmin;
+    }
 }
 
 /*
@@ -2081,6 +2086,8 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
     serialized_snapshot.curcid = snapshot->curcid;
     serialized_snapshot.whenTaken = snapshot->whenTaken;
     serialized_snapshot.lsn = snapshot->lsn;
+    serialized_snapshot.global_csn = snapshot->global_csn;
+    serialized_snapshot.imported_global_csn = snapshot->imported_global_csn;
 
     /*
      * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2155,6 +2162,8 @@ RestoreSnapshot(char *start_address)
     snapshot->curcid = serialized_snapshot.curcid;
     snapshot->whenTaken = serialized_snapshot.whenTaken;
     snapshot->lsn = serialized_snapshot.lsn;
+    snapshot->global_csn = serialized_snapshot.global_csn;
+    snapshot->imported_global_csn = serialized_snapshot.imported_global_csn;
 
     /* Copy XIDs, if present. */
     if (serialized_snapshot.xcnt > 0)
@@ -2192,3 +2201,97 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *master_pgproc)
 {
     SetTransactionSnapshot(snapshot, NULL, InvalidPid, master_pgproc);
 }
+
+/*
+ * ExportGlobalSnapshot
+ *
+ * Export global_csn so that caller can expand this transaction to other
+ * nodes.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+GlobalCSN
+ExportGlobalSnapshot()
+{
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+             errmsg("could not export global snapshot"),
+             errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                     "track_global_snapshots")));
+
+    return CurrentSnapshot->global_csn;
+}
+
+/* SQL accessor to ExportGlobalSnapshot() */
+Datum
+pg_global_snapshot_export(PG_FUNCTION_ARGS)
+{
+    GlobalCSN    global_csn = ExportGlobalSnapshot();
+    PG_RETURN_UINT64(global_csn);
+}
+
+/*
+ * ImportGlobalSnapshot
+ *
+ * Import global_csn and retract this backends xmin to the value that was
+ * actual when we had such global_csn.
+ *
+ * TODO: it's better to do this through EXPORT/IMPORT SNAPSHOT syntax and
+ * add some additional checks that transaction did not yet acquired xid, but
+ * for current iteration of this patch I don't want to hack on parser.
+ */
+void
+ImportGlobalSnapshot(GlobalCSN snap_global_csn)
+{
+    volatile TransactionId xmin;
+
+    if (!track_global_snapshots)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+             errmsg("could not import global snapshot"),
+             errhint("Make sure the configuration parameter \"%s\" is enabled.",
+                     "track_global_snapshots")));
+
+    if (global_snapshot_defer_time <= 0)
+        ereport(ERROR,
+            (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+             errmsg("could not import global snapshot"),
+             errhint("Make sure the configuration parameter \"%s\" is positive.",
+                     "global_snapshot_defer_time")));
+
+    /*
+     * Call GlobalSnapshotToXmin under ProcArrayLock to avoid situation that
+     * resulting xmin will be evicted from map before we will set it into our
+     * backend's xmin.
+     */
+    LWLockAcquire(ProcArrayLock, LW_SHARED);
+    xmin = GlobalSnapshotToXmin(snap_global_csn);
+    if (!TransactionIdIsValid(xmin))
+    {
+        LWLockRelease(ProcArrayLock);
+        elog(ERROR, "GlobalSnapshotToXmin: global snapshot too old");
+    }
+    MyProc->originalXmin = MyPgXact->xmin;
+    MyPgXact->xmin = TransactionXmin = xmin;
+    LWLockRelease(ProcArrayLock);
+
+    CurrentSnapshot->xmin = xmin; /* defuse SnapshotResetXmin() */
+    CurrentSnapshot->global_csn = snap_global_csn;
+    CurrentSnapshot->imported_global_csn = true;
+    GlobalSnapshotSync(snap_global_csn);
+
+    Assert(TransactionIdPrecedesOrEquals(RecentGlobalXmin, xmin));
+    Assert(TransactionIdPrecedesOrEquals(RecentGlobalDataXmin, xmin));
+}
+
+/* SQL accessor to ImportGlobalSnapshot() */
+Datum
+pg_global_snapshot_import(PG_FUNCTION_ARGS)
+{
+    GlobalCSN    global_csn = PG_GETARG_UINT64(0);
+    ImportGlobalSnapshot(global_csn);
+    PG_RETURN_VOID();
+}
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f7c4c9188c..f2fbc77fa8 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -63,6 +63,7 @@
 
 #include "postgres.h"
 
+#include "access/global_snapshot.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
 #include "access/subtrans.h"
@@ -1462,8 +1463,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
 }
 
 /*
- * XidInMVCCSnapshot
- *        Is the given XID still-in-progress according to the snapshot?
+ * XidInLocalMVCCSnapshot
+ *        Is the given XID still-in-progress according to the local snapshot?
  *
  * Note: GetSnapshotData never stores either top xid or subxids of our own
  * backend into a snapshot, so these xids will not be reported as "running"
@@ -1471,8 +1472,8 @@ HeapTupleIsSurelyDead(HeapTuple htup, TransactionId OldestXmin)
  * TransactionIdIsCurrentTransactionId first, except when it's known the
  * XID could not be ours anyway.
  */
-bool
-XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+static bool
+XidInLocalMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
     uint32        i;
 
@@ -1584,6 +1585,62 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 }
 
 /*
+ * XidInMVCCSnapshot
+ *
+ * Check whether this xid is in snapshot, taking into account fact that
+ * snapshot can be global. When track_global_snapshots is switched off
+ * just call XidInLocalMVCCSnapshot().
+ */
+bool
+XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
+{
+    bool in_snapshot;
+
+    if (snapshot->imported_global_csn)
+    {
+        Assert(track_global_snapshots);
+        /* No point to using snapshot info except CSN */
+        return XidInvisibleInGlobalSnapshot(xid, snapshot);
+    }
+
+    in_snapshot = XidInLocalMVCCSnapshot(xid, snapshot);
+
+    if (!track_global_snapshots)
+    {
+        Assert(GlobalCSNIsFrozen(snapshot->global_csn));
+        return in_snapshot;
+    }
+
+    if (in_snapshot)
+    {
+        /*
+         * This xid may be already in unknown state and in that case
+         * we must wait and recheck.
+         *
+         * TODO: this check can be skipped if we know for sure that there were
+         * no global transactions when this snapshot was taken. That requires
+         * some changes to mechanisms of global snapshots exprot/import (if
+         * backend set xmin then we should have a-priori knowledge that this
+         * transaction going to be global or local -- right now this is not
+         * enforced). Leave that for future and don't complicate this patch.
+         */
+        return XidInvisibleInGlobalSnapshot(xid, snapshot);
+    }
+    else
+    {
+#ifdef USE_ASSERT_CHECKING
+        /* Check that global snapshot gives the same results as local one */
+        if (XidInvisibleInGlobalSnapshot(xid, snapshot))
+        {
+            GlobalCSN gcsn = TransactionIdGetGlobalCSN(xid);
+            Assert(GlobalCSNIsAborted(gcsn));
+        }
+#endif
+        return false;
+    }
+}
+
+/*
  * Is the tuple really only locked?  That is, is it not updated?
  *
  * It's easy to check just infomask bits if the locker is not a multi; but
diff --git a/src/include/access/global_snapshot.h b/src/include/access/global_snapshot.h
new file mode 100644
index 0000000000..246b180cfd
--- /dev/null
+++ b/src/include/access/global_snapshot.h
@@ -0,0 +1,72 @@
+/*-------------------------------------------------------------------------
+ *
+ * global_snapshot.h
+ *      Support for cross-node snapshot isolation.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/global_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef GLOBAL_SNAPSHOT_H
+#define GLOBAL_SNAPSHOT_H
+
+#include "port/atomics.h"
+#include "storage/lock.h"
+#include "utils/snapshot.h"
+#include "utils/guc.h"
+
+/*
+ * snapshot.h is used in frontend code so atomic variant of GlobalCSN type
+ * is defined here.
+ */
+typedef pg_atomic_uint64 GlobalCSN_atomic;
+
+#define InProgressGlobalCSN     UINT64CONST(0x0)
+#define AbortedGlobalCSN     UINT64CONST(0x1)
+#define FrozenGlobalCSN         UINT64CONST(0x2)
+#define InDoubtGlobalCSN     UINT64CONST(0x3)
+#define FirstNormalGlobalCSN UINT64CONST(0x4)
+
+#define GlobalCSNIsInProgress(csn)    ((csn) == InProgressGlobalCSN)
+#define GlobalCSNIsAborted(csn)        ((csn) == AbortedGlobalCSN)
+#define GlobalCSNIsFrozen(csn)        ((csn) == FrozenGlobalCSN)
+#define GlobalCSNIsInDoubt(csn)        ((csn) == InDoubtGlobalCSN)
+#define GlobalCSNIsNormal(csn)        ((csn) >= FirstNormalGlobalCSN)
+
+
+extern int global_snapshot_defer_time;
+
+
+extern Size GlobalSnapshotShmemSize(void);
+extern void GlobalSnapshotShmemInit(void);
+extern void GlobalSnapshotStartup(TransactionId oldestActiveXID);
+
+extern void GlobalSnapshotMapXmin(GlobalCSN snapshot_global_csn);
+extern TransactionId GlobalSnapshotToXmin(GlobalCSN snapshot_global_csn);
+
+extern GlobalCSN GlobalSnapshotGenerate(bool locked);
+
+extern bool XidInvisibleInGlobalSnapshot(TransactionId xid, Snapshot snapshot);
+
+extern void GlobalSnapshotSync(GlobalCSN remote_gcsn);
+
+extern GlobalCSN TransactionIdGetGlobalCSN(TransactionId xid);
+
+extern GlobalCSN GlobalSnapshotPrepareGlobal(const char *gid);
+extern void GlobalSnapshotAssignCsnGlobal(const char *gid,
+                                          GlobalCSN global_csn);
+
+extern GlobalCSN GlobalSnapshotPrepareCurrent(void);
+extern void GlobalSnapshotAssignCsnCurrent(GlobalCSN global_csn);
+
+extern void GlobalSnapshotAbort(PGPROC *proc, TransactionId xid, int nsubxids,
+                                TransactionId *subxids);
+extern void GlobalSnapshotPrecommit(PGPROC *proc, TransactionId xid, int nsubxids,
+                                    TransactionId *subxids);
+extern void GlobalSnapshotCommit(PGPROC *proc, TransactionId xid, int nsubxids,
+                                    TransactionId *subxids);
+
+#endif                            /* GLOBAL_SNAPSHOT_H */
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 0e932daa48..f8b774f393 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -18,6 +18,7 @@
 #include "access/xact.h"
 #include "datatype/timestamp.h"
 #include "storage/lock.h"
+#include "utils/snapshot.h"
 
 /*
  * GlobalTransactionData is defined in twophase.c; other places have no
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a14651010f..e0e20c2e6c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10206,4 +10206,18 @@
   proisstrict => 'f', prorettype => 'bool', proargtypes => 'oid int4 int4 any',
   proargmodes => '{i,i,i,v}', prosrc => 'satisfies_hash_partition' },
 
+# global transaction handling
+{ oid => '3430', descr => 'export global transaction snapshot',
+  proname => 'pg_global_snapshot_export', provolatile => 'v', proparallel => 'u',
+  prorettype => 'int8', proargtypes => '', prosrc => 'pg_global_snapshot_export' },
+{ oid => '3431', descr => 'import global transaction snapshot',
+  proname => 'pg_global_snapshot_import', provolatile => 'v', proparallel => 'u',
+  prorettype => 'void', proargtypes => 'int8', prosrc => 'pg_global_snapshot_import' },
+{ oid => '3432', descr => 'prepare distributed transaction for commit, get global_csn',
+  proname => 'pg_global_snapshot_prepare', provolatile => 'v', proparallel => 'u',
+  prorettype => 'int8', proargtypes => 'text', prosrc => 'pg_global_snapshot_prepare' },
+{ oid => '3433', descr => 'assign global_csn to distributed transaction',
+  proname => 'pg_global_snapshot_assign', provolatile => 'v', proparallel => 'u',
+  prorettype => 'void', proargtypes => 'text int8', prosrc => 'pg_global_snapshot_assign' },
+
 ]
diff --git a/src/include/datatype/timestamp.h b/src/include/datatype/timestamp.h
index f5b6026ef5..75ec93b46b 100644
--- a/src/include/datatype/timestamp.h
+++ b/src/include/datatype/timestamp.h
@@ -93,6 +93,9 @@ typedef struct
 #define USECS_PER_MINUTE INT64CONST(60000000)
 #define USECS_PER_SEC    INT64CONST(1000000)
 
+#define NSECS_PER_SEC    INT64CONST(1000000000)
+#define NSECS_PER_USEC    INT64CONST(1000)
+
 /*
  * We allow numeric timezone offsets up to 15:59:59 either way from Greenwich.
  * Currently, the record holders for wackiest offsets in actual use are zones
diff --git a/src/include/fmgr.h b/src/include/fmgr.h
index 101f513ba6..3026e71f83 100644
--- a/src/include/fmgr.h
+++ b/src/include/fmgr.h
@@ -250,6 +250,7 @@ extern struct varlena *pg_detoast_datum_packed(struct varlena *datum);
 #define PG_GETARG_FLOAT4(n)  DatumGetFloat4(PG_GETARG_DATUM(n))
 #define PG_GETARG_FLOAT8(n)  DatumGetFloat8(PG_GETARG_DATUM(n))
 #define PG_GETARG_INT64(n)     DatumGetInt64(PG_GETARG_DATUM(n))
+#define PG_GETARG_UINT64(n)     DatumGetUInt64(PG_GETARG_DATUM(n))
 /* use this if you want the raw, possibly-toasted input datum: */
 #define PG_GETARG_RAW_VARLENA_P(n)    ((struct varlena *) PG_GETARG_POINTER(n))
 /* use this if you want the input datum de-toasted: */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f968444671..ebc3836c9c 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -138,6 +138,9 @@ typedef struct timespec instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
     (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+    (((uint64) (t).tv_sec * (uint64) 1000000000) + (uint64) ((t).tv_nsec))
+
 #else                            /* !HAVE_CLOCK_GETTIME */
 
 /* Use gettimeofday() */
@@ -202,6 +205,10 @@ typedef struct timeval instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
     (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) (t).tv_usec)
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+    (((uint64) (t).tv_sec * (uint64) 1000000000) + \
+        (uint64) (t).tv_usec * (uint64) 1000)
+
 #endif                            /* HAVE_CLOCK_GETTIME */
 
 #else                            /* WIN32 */
@@ -234,6 +241,9 @@ typedef LARGE_INTEGER instr_time;
 #define INSTR_TIME_GET_MICROSEC(t) \
     ((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
 
+#define INSTR_TIME_GET_NANOSEC(t) \
+    ((uint64) (((double) (t).QuadPart * 1000000000.0) / GetTimerFrequency()))
+
 static inline double
 GetTimerFrequency(void)
 {
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61dcf..7e24d539cc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -15,8 +15,10 @@
 #define _PROC_H_
 
 #include "access/clog.h"
+#include "access/global_snapshot.h"
 #include "access/xlogdefs.h"
 #include "lib/ilist.h"
+#include "utils/snapshot.h"
 #include "storage/latch.h"
 #include "storage/lock.h"
 #include "storage/pg_sema.h"
@@ -57,6 +59,7 @@ struct XidCache
 #define        PROC_IN_LOGICAL_DECODING    0x10    /* currently doing logical
                                                  * decoding outside xact */
 #define        PROC_RESERVED                0x20    /* reserved for procarray */
+#define        PROC_RESERVED2                0x40    /* reserved for procarray */
 
 /* flags reset at EOXact */
 #define        PROC_VACUUM_STATE_MASK \
@@ -200,6 +203,18 @@ struct PGPROC
     PGPROC       *lockGroupLeader;    /* lock group leader, if I'm a member */
     dlist_head    lockGroupMembers;    /* list of members, if I'm a leader */
     dlist_node    lockGroupLink;    /* my member link, if I'm a member */
+
+    /*
+     * assignedGlobalCsn holds GlobalCSN for this transaction.  It is generated
+     * under a ProcArray lock and later is writter to a GlobalCSNLog.  This
+     * variable defined as atomic only for case of group commit, in all other
+     * scenarios only backend responsible for this proc entry is working with
+     * this variable.
+     */
+    GlobalCSN_atomic assignedGlobalCsn;
+
+    /* Original xmin of this backend before global snapshot was imported */
+    TransactionId originalXmin;
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index 75bab2985f..e68a87575e 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -36,6 +36,10 @@
 
 #define        PROCARRAY_SLOTS_XMIN            0x20    /* replication slot xmin,
                                                      * catalog_xmin */
+
+#define        PROCARRAY_NON_IMPORTED_XMIN        0x40    /* use originalXmin instead
+                                                     * of xmin to properly
+                                                     * maintain gsXidMap */
 /*
  * Only flags in PROCARRAY_PROC_FLAGS_MASK are considered when matching
  * PGXACT->vacuumFlags. Other flags are used for different purposes and
@@ -124,4 +128,8 @@ extern void ProcArraySetReplicationSlotXmin(TransactionId xmin,
 extern void ProcArrayGetReplicationSlotXmin(TransactionId *xmin,
                                 TransactionId *catalog_xmin);
 
+extern void ProcArraySetGlobalSnapshotXmin(TransactionId xmin);
+
+extern TransactionId ProcArrayGetGlobalSnapshotXmin(void);
+
 #endif                            /* PROCARRAY_H */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 83806f3040..1a066fd8d8 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -87,6 +87,9 @@ extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
 
+extern GlobalCSN ExportGlobalSnapshot(void);
+extern void ImportGlobalSnapshot(GlobalCSN snap_global_csn);
+
 extern void ImportSnapshot(const char *idstr);
 extern bool XactHasExportedSnapshots(void);
 extern void DeleteAllExportedSnapshotFiles(void);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 318d41e6f7..1563407824 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -115,6 +115,14 @@ typedef struct SnapshotData
 
     TimestampTz whenTaken;        /* timestamp when snapshot was taken */
     XLogRecPtr    lsn;            /* position in the WAL stream when taken */
+
+    /*
+     * GlobalCSN for cross-node snapshot isolation support.
+     * Will be used only if track_global_snapshots is enabled.
+     */
+    GlobalCSN    global_csn;
+    /* Did we have our own global_csn or imported one from different node */
+    bool        imported_global_csn;
 } SnapshotData;
 
 /*
-- 
2.11.0

From 8881170d9a7e01cb1ace337e3392ad7a74725211 Mon Sep 17 00:00:00 2001
From: Stas Kelvich <stanconn@gmail.com>
Date: Wed, 25 Apr 2018 16:39:09 +0300
Subject: [PATCH 3/3] postgres_fdw support for global snapshots

---
 contrib/postgres_fdw/Makefile                  |   9 +
 contrib/postgres_fdw/connection.c              | 292 ++++++++++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.c            |  12 +
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/t/001_bank_coordinator.pl | 264 ++++++++++++++++++++++
 contrib/postgres_fdw/t/002_bank_participant.pl | 240 ++++++++++++++++++++
 src/test/perl/PostgresNode.pm                  |  31 +++
 7 files changed, 823 insertions(+), 27 deletions(-)
 create mode 100644 contrib/postgres_fdw/t/001_bank_coordinator.pl
 create mode 100644 contrib/postgres_fdw/t/002_bank_participant.pl

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 85394b4f1f..02ae067cd0 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -23,3 +23,12 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Global makefile will do temp-install for 'check'. Since REGRESS is defined,
+# PGXS (included from contrib-global.mk or directly) will care to add
+# postgres_fdw to it as EXTRA_INSTALL and build pg_regress. It will also
+# actually run pg_regress, so the only thing left is tap tests.
+check: tapcheck
+
+tapcheck: temp-install
+    $(prove_check)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..3759eaa8e1 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,9 +14,11 @@
 
 #include "postgres_fdw.h"
 
+#include "access/global_snapshot.h"
 #include "access/htup_details.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
+#include "access/xlog.h" /* GetSystemIdentifier() */
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -24,6 +26,8 @@
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
+#include "utils/snapmgr.h"
+#include "utils/snapshot.h"
 #include "utils/syscache.h"
 
 
@@ -65,6 +69,21 @@ typedef struct ConnCacheEntry
  */
 static HTAB *ConnectionHash = NULL;
 
+/*
+ * FdwTransactionState
+ *
+ * Holds number of open remote transactions and shared state
+ * needed for all connection entries.
+ */
+typedef struct FdwTransactionState
+{
+    char        *gid;
+    int            nparticipants;
+    GlobalCSN    global_csn;
+    bool        two_phase_commit;
+} FdwTransactionState;
+static FdwTransactionState *fdwTransState;
+
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
@@ -72,6 +91,9 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
+/* counter of prepared tx made by this backend */
+static int two_phase_xact_count = 0;
+
 /* prototypes of private functions */
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void disconnect_pg_server(ConnCacheEntry *entry);
@@ -80,6 +102,7 @@ static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
 static void begin_remote_xact(ConnCacheEntry *entry);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void deallocate_prepared_stmts(ConnCacheEntry *entry);
 static void pgfdw_subxact_callback(SubXactEvent event,
                        SubTransactionId mySubid,
                        SubTransactionId parentSubid,
@@ -136,6 +159,15 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
                                       pgfdw_inval_callback, (Datum) 0);
     }
 
+    /* allocate FdwTransactionState */
+    if (fdwTransState == NULL)
+    {
+        MemoryContext oldcxt;
+        oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+        fdwTransState = palloc0(sizeof(FdwTransactionState));
+        MemoryContextSwitchTo(oldcxt);
+    }
+
     /* Set flag that we did GetConnection during the current transaction */
     xact_got_connection = true;
 
@@ -388,7 +420,8 @@ configure_remote_session(PGconn *conn)
 }
 
 /*
- * Convenience subroutine to issue a non-data-returning SQL command to remote
+ * Convenience subroutine to issue a non-data-returning SQL command or
+ * statement to remote node.
  */
 static void
 do_sql_command(PGconn *conn, const char *sql)
@@ -398,7 +431,8 @@ do_sql_command(PGconn *conn, const char *sql)
     if (!PQsendQuery(conn, sql))
         pgfdw_report_error(ERROR, NULL, conn, false, sql);
     res = pgfdw_get_result(conn, sql);
-    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+    if (PQresultStatus(res) != PGRES_COMMAND_OK &&
+            PQresultStatus(res) != PGRES_TUPLES_OK)
         pgfdw_report_error(ERROR, res, conn, true, sql);
     PQclear(res);
 }
@@ -426,6 +460,10 @@ begin_remote_xact(ConnCacheEntry *entry)
         elog(DEBUG3, "starting remote transaction on connection %p",
              entry->conn);
 
+        if (UseGlobalSnapshots && (!IsolationUsesXactSnapshot() ||
+                                   IsolationIsSerializable()))
+            elog(ERROR, "Global snapshots support only REPEATABLE READ");
+
         if (IsolationIsSerializable())
             sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
         else
@@ -434,6 +472,23 @@ begin_remote_xact(ConnCacheEntry *entry)
         do_sql_command(entry->conn, sql);
         entry->xact_depth = 1;
         entry->changing_xact_state = false;
+
+        if (UseGlobalSnapshots)
+        {
+            char import_sql[128];
+
+            /* Export our snapshot */
+            if (fdwTransState->global_csn == 0)
+                fdwTransState->global_csn = ExportGlobalSnapshot();
+
+            snprintf(import_sql, sizeof(import_sql),
+                "SELECT pg_global_snapshot_import("UINT64_FORMAT")",
+                fdwTransState->global_csn);
+
+            do_sql_command(entry->conn, import_sql);
+        }
+
+        fdwTransState->nparticipants += 1;
     }
 
     /*
@@ -643,6 +698,94 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
         PQclear(res);
 }
 
+/* Callback typedef for BroadcastStmt */
+typedef bool (*BroadcastCmdResHandler) (PGresult *result, void *arg);
+
+/* Broadcast sql in parallel to all ConnectionHash entries */
+static bool
+BroadcastStmt(char const * sql, unsigned expectedStatus,
+                BroadcastCmdResHandler handler, void *arg)
+{
+    HASH_SEQ_STATUS scan;
+    ConnCacheEntry *entry;
+    bool        allOk = true;
+
+    /* Broadcast sql */
+    hash_seq_init(&scan, ConnectionHash);
+    while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+    {
+        pgfdw_reject_incomplete_xact_state_change(entry);
+
+        if (entry->xact_depth > 0 && entry->conn != NULL)
+        {
+            if (!PQsendQuery(entry->conn, sql))
+            {
+                PGresult   *res = PQgetResult(entry->conn);
+
+                elog(WARNING, "Failed to send command %s", sql);
+                pgfdw_report_error(WARNING, res, entry->conn, true, sql);
+                PQclear(res);
+            }
+        }
+    }
+
+    /* Collect responses */
+    hash_seq_init(&scan, ConnectionHash);
+    while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+    {
+        if (entry->xact_depth > 0 && entry->conn != NULL)
+        {
+            PGresult   *result = PQgetResult(entry->conn);
+
+            if (PQresultStatus(result) != expectedStatus ||
+                (handler && !handler(result, arg)))
+            {
+                elog(WARNING, "Failed command %s: status=%d, expected status=%d", sql, PQresultStatus(result),
expectedStatus);
+                pgfdw_report_error(ERROR, result, entry->conn, true, sql);
+                allOk = false;
+            }
+            PQclear(result);
+            PQgetResult(entry->conn);    /* consume NULL result */
+        }
+    }
+
+    return allOk;
+}
+
+/* Wrapper for broadcasting commands */
+static bool
+BroadcastCmd(char const *sql)
+{
+    return BroadcastStmt(sql, PGRES_COMMAND_OK, NULL, NULL);
+}
+
+/* Wrapper for broadcasting statements */
+static bool
+BroadcastFunc(char const *sql)
+{
+    return BroadcastStmt(sql, PGRES_TUPLES_OK, NULL, NULL);
+}
+
+/* Callback for selecting maximal csn */
+static bool
+MaxCsnCB(PGresult *result, void *arg)
+{
+    char           *resp;
+    GlobalCSN       *max_csn = (GlobalCSN *) arg;
+    GlobalCSN        csn = 0;
+
+    resp = PQgetvalue(result, 0, 0);
+
+    if (resp == NULL || (*resp) == '\0' ||
+            sscanf(resp, UINT64_FORMAT, &csn) != 1)
+        return false;
+
+    if (*max_csn < csn)
+        *max_csn = csn;
+
+    return true;
+}
+
 /*
  * pgfdw_xact_callback --- cleanup at main-transaction end.
  */
@@ -656,6 +799,86 @@ pgfdw_xact_callback(XactEvent event, void *arg)
     if (!xact_got_connection)
         return;
 
+    /* Handle possible two-phase commit */
+    if (event == XACT_EVENT_PARALLEL_PRE_COMMIT || event == XACT_EVENT_PRE_COMMIT)
+    {
+        bool include_local_tx = false;
+
+        /* Should we take into account this node? */
+        if (TransactionIdIsValid(GetCurrentTransactionIdIfAny()))
+        {
+            include_local_tx = true;
+            fdwTransState->nparticipants += 1;
+        }
+
+        /* Switch to 2PC mode there were more than one participant */
+        if (UseGlobalSnapshots && fdwTransState->nparticipants > 1)
+            fdwTransState->two_phase_commit = true;
+
+        if (fdwTransState->two_phase_commit)
+        {
+            GlobalCSN    max_csn = InProgressGlobalCSN,
+                        my_csn = InProgressGlobalCSN;
+            bool    res;
+            char   *sql;
+
+            fdwTransState->gid = psprintf("pgfdw:%lld:%llu:%d:%u:%d:%d",
+                                          (long long) GetCurrentTimestamp(),
+                                          (long long) GetSystemIdentifier(),
+                                          MyProcPid,
+                                          GetCurrentTransactionIdIfAny(),
+                                          ++two_phase_xact_count,
+                                          fdwTransState->nparticipants);
+
+            /* Broadcast PREPARE */
+            sql = psprintf("PREPARE TRANSACTION '%s'", fdwTransState->gid);
+            res = BroadcastCmd(sql);
+            if (!res)
+                goto error;
+
+            /* Broadcast pg_global_snapshot_prepare() */
+            if (include_local_tx)
+                my_csn = GlobalSnapshotPrepareCurrent();
+
+            sql = psprintf("SELECT pg_global_snapshot_prepare('%s')",
+                                                        fdwTransState->gid);
+            res = BroadcastStmt(sql, PGRES_TUPLES_OK, MaxCsnCB, &max_csn);
+            if (!res)
+                goto error;
+
+            /* select maximal global csn */
+            if (include_local_tx && my_csn > max_csn)
+                max_csn = my_csn;
+
+            /* Broadcast pg_global_snapshot_assign() */
+            if (include_local_tx)
+                GlobalSnapshotAssignCsnCurrent(max_csn);
+            sql = psprintf("SELECT pg_global_snapshot_assign('%s',"UINT64_FORMAT")",
+                            fdwTransState->gid, max_csn);
+            res = BroadcastFunc(sql);
+
+error:
+            if (!res)
+            {
+                sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
+                BroadcastCmd(sql);
+                elog(ERROR, "Failed to PREPARE transaction on remote node");
+            }
+
+            /*
+             * Do not fall down. Consequent COMMIT event will clean thing up.
+             */
+            return;
+        }
+    }
+
+    /* COMMIT open transaction of we were doing 2PC */
+    if (fdwTransState->two_phase_commit &&
+        (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
+    {
+        BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
+    }
+
     /*
      * Scan all connection cache entries to find open remote transactions, and
      * close them.
@@ -663,8 +886,6 @@ pgfdw_xact_callback(XactEvent event, void *arg)
     hash_seq_init(&scan, ConnectionHash);
     while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
     {
-        PGresult   *res;
-
         /* Ignore cache entry if no open connection right now */
         if (entry->conn == NULL)
             continue;
@@ -681,6 +902,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
             {
                 case XACT_EVENT_PARALLEL_PRE_COMMIT:
                 case XACT_EVENT_PRE_COMMIT:
+                    Assert(!fdwTransState->two_phase_commit);
 
                     /*
                      * If abort cleanup previously failed for this connection,
@@ -693,28 +915,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
                     do_sql_command(entry->conn, "COMMIT TRANSACTION");
                     entry->changing_xact_state = false;
 
-                    /*
-                     * If there were any errors in subtransactions, and we
-                     * made prepared statements, do a DEALLOCATE ALL to make
-                     * sure we get rid of all prepared statements. This is
-                     * annoying and not terribly bulletproof, but it's
-                     * probably not worth trying harder.
-                     *
-                     * DEALLOCATE ALL only exists in 8.3 and later, so this
-                     * constrains how old a server postgres_fdw can
-                     * communicate with.  We intentionally ignore errors in
-                     * the DEALLOCATE, so that we can hobble along to some
-                     * extent with older servers (leaking prepared statements
-                     * as we go; but we don't really support update operations
-                     * pre-8.3 anyway).
-                     */
-                    if (entry->have_prep_stmt && entry->have_error)
-                    {
-                        res = PQexec(entry->conn, "DEALLOCATE ALL");
-                        PQclear(res);
-                    }
-                    entry->have_prep_stmt = false;
-                    entry->have_error = false;
+                    deallocate_prepared_stmts(entry);
                     break;
                 case XACT_EVENT_PRE_PREPARE:
 
@@ -729,10 +930,15 @@ pgfdw_xact_callback(XactEvent event, void *arg)
                      */
                     ereport(ERROR,
                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-                             errmsg("cannot prepare a transaction that modified remote tables")));
+                                errmsg("cannot prepare a transaction that modified remote tables")));
                     break;
                 case XACT_EVENT_PARALLEL_COMMIT:
                 case XACT_EVENT_COMMIT:
+                    if (fdwTransState->two_phase_commit)
+                        deallocate_prepared_stmts(entry);
+                    else /* Pre-commit should have closed the open transaction */
+                        elog(ERROR, "missed cleaning up connection during pre-commit");
+                    break;
                 case XACT_EVENT_PREPARE:
                     /* Pre-commit should have closed the open transaction */
                     elog(ERROR, "missed cleaning up connection during pre-commit");
@@ -828,6 +1034,38 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 
     /* Also reset cursor numbering for next transaction */
     cursor_number = 0;
+
+    /* Reset fdwTransState */
+    memset(fdwTransState, '\0', sizeof(FdwTransactionState));
+}
+
+/*
+ * If there were any errors in subtransactions, and we
+ * made prepared statements, do a DEALLOCATE ALL to make
+ * sure we get rid of all prepared statements. This is
+ * annoying and not terribly bulletproof, but it's
+ * probably not worth trying harder.
+ *
+ * DEALLOCATE ALL only exists in 8.3 and later, so this
+ * constrains how old a server postgres_fdw can
+ * communicate with.  We intentionally ignore errors in
+ * the DEALLOCATE, so that we can hobble along to some
+ * extent with older servers (leaking prepared statements
+ * as we go; but we don't really support update operations
+ * pre-8.3 anyway).
+ */
+static void
+deallocate_prepared_stmts(ConnCacheEntry *entry)
+{
+    PGresult   *res;
+
+    if (entry->have_prep_stmt && entry->have_error)
+    {
+        res = PQexec(entry->conn, "DEALLOCATE ALL");
+        PQclear(res);
+    }
+    entry->have_prep_stmt = false;
+    entry->have_error = false;
 }
 
 /*
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5699252091..64279c8664 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -268,6 +268,9 @@ typedef struct
     List       *already_used;    /* expressions already dealt with */
 } ec_member_foreign_arg;
 
+bool        UseGlobalSnapshots;
+void        _PG_init(void);
+
 /*
  * SQL functions
  */
@@ -5806,3 +5809,12 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
     /* We didn't find any suitable equivalence class expression */
     return NULL;
 }
+
+void
+_PG_init(void)
+{
+    DefineCustomBoolVariable("postgres_fdw.use_global_snapshots",
+                             "Use global snapshots for FDW transactions", NULL,
+                             &UseGlobalSnapshots, false, PGC_USERSET, 0, NULL,
+                             NULL, NULL);
+}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 70b538e2f9..8cf5b12798 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -186,4 +186,6 @@ extern const char *get_jointype_name(JoinType jointype);
 extern bool is_builtin(Oid objectId);
 extern bool is_shippable(Oid objectId, Oid classId, PgFdwRelationInfo *fpinfo);
 
+extern bool UseGlobalSnapshots;
+
 #endif                            /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/t/001_bank_coordinator.pl b/contrib/postgres_fdw/t/001_bank_coordinator.pl
new file mode 100644
index 0000000000..1e31f33349
--- /dev/null
+++ b/contrib/postgres_fdw/t/001_bank_coordinator.pl
@@ -0,0 +1,264 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+my $master = get_new_node("master");
+$master->init;
+$master->append_conf('postgresql.conf', qq(
+    max_prepared_transactions = 30
+    log_checkpoints = true
+    postgres_fdw.use_global_snapshots = on
+    track_global_snapshots = on
+    default_transaction_isolation = 'REPEATABLE READ'
+));
+$master->start;
+
+my $shard1 = get_new_node("shard1");
+$shard1->init;
+$shard1->append_conf('postgresql.conf', qq(
+    max_prepared_transactions = 30
+    global_snapshot_defer_time = 15
+    track_global_snapshots = on
+));
+$shard1->start;
+
+my $shard2 = get_new_node("shard2");
+$shard2->init;
+$shard2->append_conf('postgresql.conf', qq(
+    max_prepared_transactions = 30
+    global_snapshot_defer_time = 15
+    track_global_snapshots = on
+));
+$shard2->start;
+
+###############################################################################
+# Prepare nodes
+###############################################################################
+
+$master->safe_psql('postgres', qq[
+    CREATE EXTENSION postgres_fdw;
+    CREATE TABLE accounts(id integer primary key, amount integer);
+    CREATE TABLE global_transactions(tx_time timestamp);
+]);
+
+foreach my $node ($shard1, $shard2)
+{
+    my $port = $node->port;
+    my $host = $node->host;
+
+    $node->safe_psql('postgres',
+            "CREATE TABLE accounts(id integer primary key, amount integer)");
+
+    $master->safe_psql('postgres', qq[
+        CREATE SERVER shard_$port FOREIGN DATA WRAPPER postgres_fdw options(dbname 'postgres', host '$host', port
'$port');
+        CREATE FOREIGN TABLE accounts_fdw_$port() inherits (accounts) server shard_$port options(table_name
'accounts');
+        CREATE USER MAPPING for CURRENT_USER SERVER shard_$port;
+    ])
+}
+
+$shard1->safe_psql('postgres', qq[
+    insert into accounts select 2*id-1, 0 from generate_series(1, 10010) as id;
+    CREATE TABLE local_transactions(tx_time timestamp);
+]);
+
+$shard2->safe_psql('postgres', qq[
+    insert into accounts select 2*id, 0 from generate_series(1, 10010) as id;
+    CREATE TABLE local_transactions(tx_time timestamp);
+]);
+
+diag("master: @{[$master->connstr('postgres')]}");
+diag("shard1: @{[$shard1->connstr('postgres')]}");
+diag("shard2: @{[$shard2->connstr('postgres')]}");
+
+###############################################################################
+# pgbench scripts
+###############################################################################
+
+my $bank = File::Temp->new();
+append_to_file($bank, q{
+    \set id random(1, 20000)
+    BEGIN;
+    WITH upd AS (UPDATE accounts SET amount = amount - 1 WHERE id = :id RETURNING *)
+        INSERT into global_transactions SELECT now() FROM upd;
+    UPDATE accounts SET amount = amount + 1 WHERE id = (:id + 1);
+    COMMIT;
+});
+
+my $bank1 = File::Temp->new();
+append_to_file($bank1, q{
+    \set id random(1, 10000)
+    BEGIN;
+    WITH upd AS (UPDATE accounts SET amount = amount - 1 WHERE id = (2*:id + 1) RETURNING *)
+        INSERT into local_transactions SELECT now() FROM upd;
+    UPDATE accounts SET amount = amount + 1 WHERE id = (2*:id + 3);
+    COMMIT;
+});
+
+my $bank2 = File::Temp->new();
+append_to_file($bank2, q{
+    \set id random(1, 10000)
+
+    BEGIN;
+    WITH upd AS (UPDATE accounts SET amount = amount - 1 WHERE id = 2*:id RETURNING *)
+        INSERT into local_transactions SELECT now() FROM upd;
+    UPDATE accounts SET amount = amount + 1 WHERE id = (2*:id + 2);
+    COMMIT;
+});
+
+###############################################################################
+# Helpers
+###############################################################################
+
+sub count_and_delete_rows
+{
+    my ($node, $table) = @_;
+    my $count;
+
+    $count = $node->safe_psql('postgres',"select count(*) from $table");
+    $node->safe_psql('postgres',"delete from $table");
+    diag($node->name, ": completed $count transactions");
+    return $count;
+}
+
+###############################################################################
+# Concurrent global transactions
+###############################################################################
+
+my ($err, $rc);
+my $started;
+my $seconds = 30;
+my $selects;
+my $total = '0';
+my $oldtotal = '0';
+my $isolation_errors = 0;
+
+
+my $pgb_handle;
+
+$pgb_handle = $master->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+
+$started = time();
+$selects = 0;
+while (time() - $started < $seconds)
+{
+    $total = $master->safe_psql('postgres', "select sum(amount) from accounts");
+    if ( ($total ne $oldtotal) and ($total ne '') )
+    {
+        $isolation_errors++;
+        $oldtotal = $total;
+        diag("Isolation error. Total = $total");
+    }
+    if ($total ne '') { $selects++; }
+}
+
+$master->pgbench_await($pgb_handle);
+
+# sanity check
+diag("completed $selects selects");
+die "no actual transactions happend" unless ( $selects > 0 &&
+    count_and_delete_rows($master, 'global_transactions') > 0);
+
+is($isolation_errors, 0, 'isolation between concurrent global transaction');
+
+###############################################################################
+# Concurrent global and local transactions
+###############################################################################
+
+my ($pgb_handle1, $pgb_handle2, $pgb_handle3);
+
+# global txses
+$pgb_handle1 = $master->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+
+# concurrent local
+$pgb_handle2 = $shard1->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank1, 'postgres' );
+$pgb_handle3 = $shard2->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank2, 'postgres' );
+
+$started = time();
+$selects = 0;
+$oldtotal = 0;
+while (time() - $started < $seconds)
+{
+    $total = $master->safe_psql('postgres', "select sum(amount) from accounts");
+    if ( ($total ne $oldtotal) and ($total ne '') )
+    {
+        $isolation_errors++;
+        $oldtotal = $total;
+        diag("Isolation error. Total = $total");
+    }
+    if ($total ne '') { $selects++; }
+}
+
+diag("selects = $selects");
+$master->pgbench_await($pgb_handle1);
+$shard1->pgbench_await($pgb_handle2);
+$shard2->pgbench_await($pgb_handle3);
+
+diag("completed $selects selects");
+die "" unless ( $selects > 0 &&
+    count_and_delete_rows($master, 'global_transactions') > 0 &&
+    count_and_delete_rows($shard1, 'local_transactions') > 0 &&
+    count_and_delete_rows($shard2, 'local_transactions') > 0);
+
+is($isolation_errors, 0, 'isolation between concurrent global and local transactions');
+
+
+###############################################################################
+# Snapshot stability
+###############################################################################
+
+my ($hashes, $hash1, $hash2);
+my $stability_errors = 0;
+
+# global txses
+$pgb_handle1 = $master->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+# concurrent local
+$pgb_handle2 = $shard1->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank1, 'postgres' );
+$pgb_handle3 = $shard2->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank2, 'postgres' );
+
+$selects = 0;
+$started = time();
+while (time() - $started < $seconds)
+{
+    foreach my $node ($master, $shard1, $shard2)
+    {
+        ($hash1, $_, $hash2) = split "\n", $node->safe_psql('postgres', qq[
+            begin isolation level repeatable read;
+            select md5(array_agg((t.*)::text)::text) from (select * from accounts order by id) as t;
+            select pg_sleep(3);
+            select md5(array_agg((t.*)::text)::text) from (select * from accounts order by id) as t;
+            commit;
+        ]);
+
+        if ($hash1 ne $hash2)
+        {
+            diag("oops");
+            $stability_errors++;
+        }
+        elsif ($hash1 eq '' or $hash2 eq '')
+        {
+            die;
+        }
+        else
+        {
+            $selects++;
+        }
+    }
+}
+
+$master->pgbench_await($pgb_handle1);
+$shard1->pgbench_await($pgb_handle2);
+$shard2->pgbench_await($pgb_handle3);
+
+die "" unless ( $selects > 0 &&
+    count_and_delete_rows($master, 'global_transactions') > 0 &&
+    count_and_delete_rows($shard1, 'local_transactions') > 0 &&
+    count_and_delete_rows($shard2, 'local_transactions') > 0);
+
+is($stability_errors, 0, 'snapshot is stable during concurrent global and local transactions');
+
+$master->stop;
+$shard1->stop;
+$shard2->stop;
diff --git a/contrib/postgres_fdw/t/002_bank_participant.pl b/contrib/postgres_fdw/t/002_bank_participant.pl
new file mode 100644
index 0000000000..bf80d21d7a
--- /dev/null
+++ b/contrib/postgres_fdw/t/002_bank_participant.pl
@@ -0,0 +1,240 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 3;
+
+my $shard1 = get_new_node("shard1");
+$shard1->init;
+$shard1->append_conf('postgresql.conf', qq(
+    max_prepared_transactions = 30
+    postgres_fdw.use_global_snapshots = on
+    global_snapshot_defer_time = 15
+    track_global_snapshots = on
+    default_transaction_isolation = 'REPEATABLE READ'
+));
+$shard1->start;
+
+my $shard2 = get_new_node("shard2");
+$shard2->init;
+$shard2->append_conf('postgresql.conf', qq(
+    max_prepared_transactions = 30
+    postgres_fdw.use_global_snapshots = on
+    global_snapshot_defer_time = 15
+    track_global_snapshots = on
+    default_transaction_isolation = 'REPEATABLE READ'
+));
+$shard2->start;
+
+###############################################################################
+# Prepare nodes
+###############################################################################
+
+my @shards = ($shard1, $shard2);
+
+foreach my $node (@shards)
+{
+    $node->safe_psql('postgres', qq[
+        CREATE EXTENSION postgres_fdw;
+        CREATE TABLE accounts(id integer primary key, amount integer);
+        CREATE TABLE accounts_local() inherits(accounts);
+        CREATE TABLE global_transactions(tx_time timestamp);
+        CREATE TABLE local_transactions(tx_time timestamp);
+    ]);
+
+    foreach my $neighbor (@shards)
+    {
+        next if ($neighbor eq $node);
+
+        my $port = $neighbor->port;
+        my $host = $neighbor->host;
+
+        $node->safe_psql('postgres', qq[
+            CREATE SERVER shard_$port FOREIGN DATA WRAPPER postgres_fdw
+                    options(dbname 'postgres', host '$host', port '$port');
+            CREATE FOREIGN TABLE accounts_fdw_$port() inherits (accounts)
+                    server shard_$port options(table_name 'accounts_local');
+            CREATE USER MAPPING for CURRENT_USER SERVER shard_$port;
+        ]);
+    }
+}
+
+$shard1->psql('postgres', "insert into accounts_local select 2*id-1, 0 from generate_series(1, 10010) as id;");
+$shard2->psql('postgres', "insert into accounts_local select 2*id,   0 from generate_series(1, 10010) as id;");
+
+###############################################################################
+# pgbench scripts
+###############################################################################
+
+my $bank = File::Temp->new();
+append_to_file($bank, q{
+    \set id random(1, 20000)
+    BEGIN;
+    WITH upd AS (UPDATE accounts SET amount = amount - 1 WHERE id = :id RETURNING *)
+        INSERT into global_transactions SELECT now() FROM upd;
+    UPDATE accounts SET amount = amount + 1 WHERE id = (:id + 1);
+    COMMIT;
+});
+
+###############################################################################
+# Helpers
+###############################################################################
+
+sub count_and_delete_rows
+{
+    my ($node, $table) = @_;
+    my $count;
+
+    $count = $node->safe_psql('postgres',"select count(*) from $table");
+    $node->safe_psql('postgres',"delete from $table");
+    diag($node->name, ": completed $count transactions");
+    return $count;
+}
+
+###############################################################################
+# Concurrent global transactions
+###############################################################################
+
+my ($err, $rc);
+my $started;
+my $seconds = 30;
+my $selects;
+my $total = '0';
+my $oldtotal = '0';
+my $isolation_errors = 0;
+my $i;
+
+
+my ($pgb_handle1, $pgb_handle2);
+
+$pgb_handle1 = $shard1->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+$pgb_handle2 = $shard2->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+
+$started = time();
+$selects = 0;
+$i = 0;
+while (time() - $started < $seconds)
+{
+    my $shard = $shard1;
+    foreach my $shard (@shards)
+    {
+        $total = $shard->safe_psql('postgres', "select sum(amount) from accounts");
+        if ( ($total ne $oldtotal) and ($total ne '') )
+        {
+            $isolation_errors++;
+            $oldtotal = $total;
+            diag("$i: Isolation error. Total = $total");
+        }
+        if ($total ne '') { $selects++; }
+    }
+    $i++;
+}
+
+$shard1->pgbench_await($pgb_handle1);
+$shard2->pgbench_await($pgb_handle2);
+
+# sanity check
+diag("completed $selects selects");
+die "no actual transactions happend" unless ( $selects > 0 &&
+    count_and_delete_rows($shard1, 'global_transactions') > 0 &&
+    count_and_delete_rows($shard2, 'global_transactions') > 0);
+
+is($isolation_errors, 0, 'isolation between concurrent global transaction');
+
+###############################################################################
+# And do the same after soft restart
+###############################################################################
+
+$shard1->restart;
+$shard2->restart;
+$shard1->poll_query_until('postgres', "select 't'")
+    or die "Timed out waiting for shard1 to became online";
+$shard2->poll_query_until('postgres', "select 't'")
+    or die "Timed out waiting for shard2 to became online";
+
+$seconds = 15;
+$pgb_handle1 = $shard1->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+$pgb_handle2 = $shard2->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+
+$started = time();
+$selects = 0;
+$i = 0;
+
+while (time() - $started < $seconds)
+{
+    my $shard = $shard1;
+    foreach my $shard (@shards)
+    {
+        $total = $shard->safe_psql('postgres', "select sum(amount) from accounts");
+        if ( ($total ne $oldtotal) and ($total ne '') )
+        {
+            $isolation_errors++;
+            $oldtotal = $total;
+            diag("$i: Isolation error. Total = $total");
+        }
+        if ($total ne '') { $selects++; }
+    }
+    $i++;
+}
+
+$shard1->pgbench_await($pgb_handle1);
+$shard2->pgbench_await($pgb_handle2);
+
+# sanity check
+diag("completed $selects selects");
+die "no actual transactions happend" unless ( $selects > 0 &&
+    count_and_delete_rows($shard1, 'global_transactions') > 0 &&
+    count_and_delete_rows($shard2, 'global_transactions') > 0);
+
+is($isolation_errors, 0, 'isolation between concurrent global transaction after restart');
+
+###############################################################################
+# And do the same after hard restart
+###############################################################################
+
+$shard1->teardown_node;
+$shard2->teardown_node;
+$shard1->start;
+$shard2->start;
+$shard1->poll_query_until('postgres', "select 't'")
+    or die "Timed out waiting for shard1 to became online";
+$shard2->poll_query_until('postgres', "select 't'")
+    or die "Timed out waiting for shard2 to became online";
+
+
+$seconds = 15;
+$pgb_handle1 = $shard1->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+$pgb_handle2 = $shard2->pgbench_async(-n, -c => 5, -T => $seconds, -f => $bank, 'postgres' );
+
+$started = time();
+$selects = 0;
+$i = 0;
+
+while (time() - $started < $seconds)
+{
+    my $shard = $shard1;
+    foreach my $shard (@shards)
+    {
+        $total = $shard->safe_psql('postgres', "select sum(amount) from accounts");
+        if ( ($total ne $oldtotal) and ($total ne '') )
+        {
+            $isolation_errors++;
+            $oldtotal = $total;
+            diag("$i: Isolation error. Total = $total");
+        }
+        if ($total ne '') { $selects++; }
+    }
+    $i++;
+}
+
+$shard1->pgbench_await($pgb_handle1);
+$shard2->pgbench_await($pgb_handle2);
+
+# sanity check
+diag("completed $selects selects");
+die "no actual transactions happend" unless ( $selects > 0 &&
+    count_and_delete_rows($shard1, 'global_transactions') > 0 &&
+    count_and_delete_rows($shard2, 'global_transactions') > 0);
+
+is($isolation_errors, 0, 'isolation between concurrent global transaction after hard restart');
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 79fb457075..d72a28f661 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1796,6 +1796,37 @@ sub pg_recvlogical_upto
     }
 }
 
+sub pgbench()
+{
+    my ($self, $node, @args) = @_;
+    my $pgbench_handle = $self->pgbench_async($node, @args);
+    $self->pgbench_await($pgbench_handle);
+}
+
+sub pgbench_async()
+{
+    my ($self, @args) = @_;
+
+    my ($in, $out, $err, $rc);
+    $in = '';
+    $out = '';
+
+    my @pgbench_command = (
+        'pgbench',
+        -h => $self->host,
+        -p => $self->port,
+        @args
+    );
+    my $handle = IPC::Run::start(\@pgbench_command, $in, $out);
+    return $handle;
+}
+
+sub pgbench_await()
+{
+    my ($self, $pgbench_handle) = @_;
+    IPC::Run::finish($pgbench_handle) || BAIL_OUT("pgbench exited with $?");
+}
+
 =pod
 
 =back
-- 
2.11.0


Re: Global snapshots

From
Andrey Borodin
Date:
Hi!

I want to review this patch set. Though I understand that it probably will be quite long process.

I like the idea that with this patch set universally all postgres instances are bound into single distributed DB, even
ifthey never heard about each other before :) This is just amazing. Or do I get something wrong? 

I've got few questions:
1. If we coordinate HA-clusters with replicas, can replicas participate if their part of transaction is read-only?
2. How does InDoubt transaction behave when we add or subtract leap seconds?

Also, I could not understand some notes from Arseny:

> 25 июля 2018 г., в 16:35, Arseny Sher <a.sher@postgrespro.ru> написал(а):
>
> * One drawback of these patches is that only REPEATABLE READ is
>   supported. For READ COMMITTED, we must export every new snapshot
>   generated on coordinator to all nodes, which is fairly easy to
>   do. SERIALIZABLE will definitely require chattering between nodes,
>   but that's much less demanded isolevel (e.g. we still don't support
>   it on replicas).

If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit?

If you have transactions on server A and server B, there are transactions 1 and 2, transaction A1 is serialized before
A2,but B1 is after B2, right? 

Maybe we can somehow abort 1 or 2?

>
> * Another somewhat serious issue is that there is a risk of recency
>   guarantee violation. If client starts transaction at node with
>   lagging clocks, its snapshot might not include some recently
>   committed transactions; if client works with different nodes, she
>   might not even see her own changes. CockroachDB describes at [1] how
>   they and Google Spanner overcome this problem. In short, both set
>   hard limit on maximum allowed clock skew.  Spanner uses atomic
>   clocks, so this skew is small and they just wait it at the end of
>   each transaction before acknowledging the client. In CockroachDB, if
>   tuple is not visible but we are unsure whether it is truly invisible
>   or it's just the skew (the difference between snapshot and tuple's
>   csn is less than the skew), transaction is restarted with advanced
>   snapshot. This process is not infinite because the upper border
>   (initial snapshot + max skew) stays the same; this is correct as we
>   just want to ensure that our xact sees all the committed ones before
>   it started. We can implement the same thing.
I think that this situation is also covered in Clock-SI since transactions will not exit InDoubt state before we can
seethem. But I'm not sure, chances are that I get something wrong, I'll think more about it. I'd be happy to hear
commentsfrom Stas about this. 
>
>
> * 003_bank_shared.pl test is removed. In current shape (loading one
>   node) it is useless, and if we bombard both nodes, deadlock surely
>   appears. In general, global snaphots are not needed for such
>   multimaster-like setup -- either there are no conflicts and we are
>   fine, or there is a conflict, in which case we get a deadlock.
Can we do something with this deadlock? Will placing an upper limit on time of InDoubt state fix the issue? I
understandthat aborting automatically is kind of dangerous... 

Also, currently hanging 2pc transaction can cause a lot of headache for DBA. Can we have some kind of protection for
thecase when one node is gone permanently during transaction? 

Thanks!

Best regards, Andrey Borodin.

Re: Global snapshots

From
Arseny Sher
Date:
Hello,

Andrey Borodin <x4mmm@yandex-team.ru> writes:

> I like the idea that with this patch set universally all postgres
> instances are bound into single distributed DB, even if they never
> heard about each other before :) This is just amazing. Or do I get
> something wrong?

Yeah, in a sense of xact visibility we can view it like this.

> I've got few questions:
> 1. If we coordinate HA-clusters with replicas, can replicas
> participate if their part of transaction is read-only?

Ok, there are several things to consider. Clock-SI as described in the
paper technically boils down to three things. First two assume CSN-based
implementation of MVCC where local time acts as CSN/snapshot source, and
they impose the following additional rules:

1) When xact expands on some node and imports its snapshot, it must be
  blocked until clocks on this node will show time >= snapshot being
  imported: node never processes xacts with snap 'from the future'
2) When we choose CSN to commit xact with, we must read clocks on
  all the nodes who participated in it and set CSN to max among read
  values.

These rules ensure *integrity* of the snapshot in the face of clock
skews regardless of which node we access: that is, snapshots are stable
(no non-repeatable reads) and no xact is considered half committed: they
prevent situation when the snapshot sees some xact on one node as
committed and on another node as still running.
(Actually, this is only true under the assumption that any distributed
xact is committed at all nodes instantly at the same time; this is
obviously not true, see 3rd point below.)

If we are speaking about e.g. traditional usage of hot standy, when
client in one xact accesses either primary, or some standby, but not
several nodes at once, we just don't need this stuff because usual MVCC
in Postgres already provides you consistent snapshot. Same is true for
multimaster-like setups, when each node accepts writes, but client still
accesses single node; if there is a write conflict (e.g. same row
updated on different nodes), one of xacts must be aborted; the snapshot
is still good.

However, if you really intend to read in one xact data from multiple
nodes (e.g. read primary and then replica), then yes, these problems
arise and Clock-SI helps with them. However, honestly it is hard for me
to make up a reason why would you want to do that: reading local data is
always more efficient than visiting several nodes. It would make sense
if we could read primary and replica in parallel, but that currently is
impossible in core Postgres. More straightforward application of the
patchset is sharding, when data is splitted and you might need to go to
several nodes in one xact to collect needed data.

Also, this patchset adds core algorithm and makes use of it only in
postgres_fdw; you would need to adapt replication (import/export global
snapshot API) to make it work there.

3) The third rule of Clock-SI deals with the following problem.
  Distributed (writing to several nodes) xact doesn't commit (i.e.
  becomes visible) instantly at all nodes. That means that there is a
  time hole in which we can see xact as committed on some node and still
  running on another. To mitigate this, Clock-SI adds kind of two-phase
  commit on visibility: additional state InDoubt which blocks all
  attempts to read this xact changes until xact's fate (commit/abort) is
  determined.

Unlike the previous problem, this issue exists in all replicated
setups. Even if we just have primary streaming data to one hot standby,
xacts are not committed on them instantly and we might observe xact as
committed on primary, then quickly switch to standby and find the data
we have just seen disappeared. remote_apply mode partially alleviates
this problem (apparently to the degree comfortable for most application
developers) by switching positions: with it xact always commits on
replicas earlier than on master. At least this guarantees that the guy
who wrote the xact will definitely see it on replica unless it drops the
connection to master before commit ack. Still the problem is not fully
solved: only addition of InDoubt state can fix this.

While Clock-SI (and this patchset) certainly addresses the issue as it
becomes even more serious in sharded setups (it makes possible to see
/parts/ of transactions), there is nothing CSN or clock specific
here. In theory, we could implement the same two-phase commit on
visiblity without switching to timestamp-based CSN MVCC.

Aside from the paper, you can have a look at Clock-SI explanation in
these slides [1] from PGCon.

> 2. How does InDoubt transaction behave when we add or subtract leap seconds?

Good question! In Clock-SI, time can be arbitrary desynchronized and
might go forward with arbitrary speed (e.g. clocks can be stopped), but
it must never go backwards. So if leap second correction is implemented
by doubling the duration of certain second (as it usually seems to be),
we are fine.

> Also, I could not understand some notes from Arseny:
>
>> 25 июля 2018 г., в 16:35, Arseny Sher <a.sher@postgrespro.ru> написал(а):
>>
>> * One drawback of these patches is that only REPEATABLE READ is
>>   supported. For READ COMMITTED, we must export every new snapshot
>>   generated on coordinator to all nodes, which is fairly easy to
>>   do. SERIALIZABLE will definitely require chattering between nodes,
>>   but that's much less demanded isolevel (e.g. we still don't support
>>   it on replicas).
>
> If all shards are executing transaction in SERIALIZABLE, what anomalies does it permit?
>
> If you have transactions on server A and server B, there are
> transactions 1 and 2, transaction A1 is serialized before A2, but B1
> is after B2, right?
>
> Maybe we can somehow abort 1 or 2?

Yes, your explanation is concise and correct.
To put it in another way, ensuring SERIALIZABLE in MVCC requires
tracking reads, and there is no infrastructure for doing it
globally. Classical write skew is possible: you have node A holding x
and node B holding y, initially x = y = 30 and there is a constraint x +
y > 0.
Two concurrent xacts start:
T1: x = x - 42;
T2: y = y - 42;
They don't see each other, so both commit successfully and the
constraint is violated. We need to transfer info about reads between
nodes to know when we need to abort someone.

>>
>> * Another somewhat serious issue is that there is a risk of recency
>>   guarantee violation. If client starts transaction at node with
>>   lagging clocks, its snapshot might not include some recently
>>   committed transactions; if client works with different nodes, she
>>   might not even see her own changes. CockroachDB describes at [1] how
>>   they and Google Spanner overcome this problem. In short, both set
>>   hard limit on maximum allowed clock skew.  Spanner uses atomic
>>   clocks, so this skew is small and they just wait it at the end of
>>   each transaction before acknowledging the client. In CockroachDB, if
>>   tuple is not visible but we are unsure whether it is truly invisible
>>   or it's just the skew (the difference between snapshot and tuple's
>>   csn is less than the skew), transaction is restarted with advanced
>>   snapshot. This process is not infinite because the upper border
>>   (initial snapshot + max skew) stays the same; this is correct as we
>>   just want to ensure that our xact sees all the committed ones before
>>   it started. We can implement the same thing.
> I think that this situation is also covered in Clock-SI since
> transactions will not exit InDoubt state before we can see them. But
> I'm not sure, chances are that I get something wrong, I'll think more
> about it. I'd be happy to hear comments from Stas about this.

InDoubt state protects from seeing xact who is not yet committed
everywhere, but it doesn't protect from starting xact on node with
lagging clocks, obtaining plainly old snapshot. We won't see any
half-committed data (InDoubt covers us here) with it, but some recently
committed xacts might not get into our old snapshot entirely.

>> * 003_bank_shared.pl test is removed. In current shape (loading one
>>   node) it is useless, and if we bombard both nodes, deadlock surely
>>   appears. In general, global snaphots are not needed for such
>>   multimaster-like setup -- either there are no conflicts and we are
>>   fine, or there is a conflict, in which case we get a deadlock.
> Can we do something with this deadlock? Will placing an upper limit on
> time of InDoubt state fix the issue? I understand that aborting
> automatically is kind of dangerous...

Sure, this is just a generalization of basic deadlock problem to
distributed system. To deal with it, someone must periodically collect
locks across the nodes, build graph, check for loops and punish (abort)
one of its chain creators, if loop exists. InDoubt (and global snapshots
in general) is unrelated to this, we hanged on usual row-level lock in
this test. BTW, our pg_shardman extension has primitive deadlock
detector [2], I suppose Citus [3] also has one.

> Also, currently hanging 2pc transaction can cause a lot of headache
> for DBA. Can we have some kind of protection for the case when one
> node is gone permanently during transaction?

Oh, subject of automatic 2PC xacts resolution is also matter of another
(probably many-miles) thread, largely unrelated to global
snapshots/visibility. In general, the problem of distributed transaction
commit which doesn't block while majority of nodes is alive requires
implementing distributed consensus algorithm like Paxos or Raft. You
might also find thread [4] interesting.

Thank you for your interest in the topic!

[1] https://yadi.sk/i/qgmFeICvuRwYNA
[2] https://github.com/postgrespro/pg_shardman/blob/broadcast/pg_shardman--0.0.3.sql#L2497
[3] https://www.citusdata.com/
[4]
https://www.postgresql.org/message-id/flat/CAFjFpRc5Eo%3DGqgQBa1F%2B_VQ-q_76B-d5-Pt0DWANT2QS24WE7w%40mail.gmail.com

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Global snapshots

From
Dmitry Dolgov
Date:
> On Wed, Jul 25, 2018 at 1:35 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> Hello,
>
> I have looked through the patches and found them pretty accurate. I'd
> fixed a lot of small issues here and there; updated patchset is
> attached.

Hi,

Thank you for working on this patch. Unfortunately, the patch has some
conflicts, could you please rebase it? Also I wonder if you or Stas can shed
some light about this:

> On Wed, May 16, 2018 at 2:02 PM Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
> > On 15 May 2018, at 15:53, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > I guess it seems to me that you
> > have some further research to do along the lines you've described:
> >
> > 1. Can we hold back xmin only when necessary and to the extent
> > necessary instead of all the time?
> > 2. Can we use something like an STO analog, maybe as an optional
> > feature, rather than actually holding back xmin?
>
> Yes, to both questions. I'll implement that and share results.

Is there any resulting patch where the ideas how to implement this are outlined?


Re: Global snapshots

From
Stas Kelvich
Date:

> On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
>> On Wed, Jul 25, 2018 at 1:35 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>>
>> Hello,
>>
>> I have looked through the patches and found them pretty accurate. I'd
>> fixed a lot of small issues here and there; updated patchset is
>> attached.
>
> Hi,
>
> Thank you for working on this patch. Unfortunately, the patch has some
> conflicts, could you please rebase it?

Rebased onto current master (dcfdf56e89a). Also I corrected few formatting issues
and worked around new pgbench return codes policy in tests.

> Also I wonder if you or Stas can shed
> some light about this:
>
>> On Wed, May 16, 2018 at 2:02 PM Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>>> On 15 May 2018, at 15:53, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> I guess it seems to me that you
>>> have some further research to do along the lines you've described:
>>>
>>> 1. Can we hold back xmin only when necessary and to the extent
>>> necessary instead of all the time?
>>> 2. Can we use something like an STO analog, maybe as an optional
>>> feature, rather than actually holding back xmin?
>>
>> Yes, to both questions. I'll implement that and share results.
>
> Is there any resulting patch where the ideas how to implement this are outlined?

Not yet. I’m going to continue work on this in January. And probably try to
force some of nearby committers to make a line by line review.



-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Attachment

Re: Global snapshots

From
Andres Freund
Date:
Hi,

On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
> > On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > Is there any resulting patch where the ideas how to implement this are outlined?
> 
> Not yet. I’m going to continue work on this in January. And probably try to
> force some of nearby committers to make a line by line review.

This hasn't happened yet, so I think this ought to be marked ad returned
with feedback?

- Andres


Re: Global snapshots

From
Stas Kelvich
Date:

> On 31 Jan 2019, at 18:42, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2018-11-30 16:00:17 +0300, Stas Kelvich wrote:
>>> On 29 Nov 2018, at 18:21, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>>> Is there any resulting patch where the ideas how to implement this are outlined?
>>
>> Not yet. I’m going to continue work on this in January. And probably try to
>> force some of nearby committers to make a line by line review.
>
> This hasn't happened yet, so I think this ought to be marked ad returned
> with feedback?
>

No objections. I don't think this will realistically go in during last CF, so
will open it during next release cycle.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Global snapshots

From
Andrey Borodin
Date:
Hi!

> 30 нояб. 2018 г., в 18:00, Stas Kelvich <s.kelvich@postgrespro.ru> написал(а):
>
>
>
<0001-GlobalCSNLog-SLRU-v3.patch><0002-Global-snapshots-v3.patch><0003-postgres_fdw-support-for-global-snapshots-v3.patch>

In spite of recent backup discussions I realized that we need to backup clusters even if they provide global snapshot
capabilities.

I think we can have pretty elegant Point-in-CSN-Recovery here, right? If we want a group of clusters to recover to a
globallyconsistent state. 

Best regards, Andrey Borodin.


Re: Global snapshots

From
Andrey Lepikhov
Date:
Rebased onto current master (fb544735f1).

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachment

Re: Global snapshots

From
Fujii Masao
Date:

On 2020/05/12 19:24, Andrey Lepikhov wrote:
> Rebased onto current master (fb544735f1).

Thanks for the patches!

These patches are no longer applied cleanly and caused the compilation failure.
So could you rebase and update them?

The patches seem not to be registered in CommitFest yet.
Are you planning to do that?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
On 09.06.2020 11:41, Fujii Masao wrote:
>
>
> On 2020/05/12 19:24, Andrey Lepikhov wrote:
>> Rebased onto current master (fb544735f1).
>
> Thanks for the patches!
>
> These patches are no longer applied cleanly and caused the compilation 
> failure.
> So could you rebase and update them?
Rebased onto 57cb806308 (see attachment).
>
> The patches seem not to be registered in CommitFest yet.
> Are you planning to do that?
Not now. It is a sharding-related feature. I'm not sure that this 
approach is fully consistent with the sharding way now.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com


Attachment

Re: Global snapshots

From
Amit Kapila
Date:
On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
>
> On 09.06.2020 11:41, Fujii Masao wrote:
> >
> >
> > The patches seem not to be registered in CommitFest yet.
> > Are you planning to do that?
> Not now. It is a sharding-related feature. I'm not sure that this
> approach is fully consistent with the sharding way now.
>

Can you please explain in detail, why you think so?  There is no
commit message explaining what each patch does so it is difficult to
understand why you said so?  Also, can you let us know if this
supports 2PC in some way and if so how is it different from what the
other thread on the same topic [1] is trying to achieve?  Also, I
would like to know if the patch related to CSN based snapshot [2] is a
precursor for this, if not, then is it any way related to this patch
because I see the latest reply on that thread [2] which says it is an
infrastructure of sharding feature but I don't understand completely
whether these patches are related?

Basically, there seem to be three threads, first, this one and then
[1] and [2] which seems to be doing the work for sharding feature but
there is no clear explanation anywhere if these are anyway related or
whether combining all these three we are aiming for a solution for
atomic commit and atomic visibility.

I am not sure if you know answers to all these questions so I added
the people who seem to be working on the other two patches.  I am also
afraid that if there is any duplicate or conflicting work going on in
these threads so we should try to find that as well.


[1] - https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
On 6/19/20 11:48 AM, Amit Kapila wrote:
> On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> On 09.06.2020 11:41, Fujii Masao wrote:
>>> The patches seem not to be registered in CommitFest yet.
>>> Are you planning to do that?
>> Not now. It is a sharding-related feature. I'm not sure that this
>> approach is fully consistent with the sharding way now.
> Can you please explain in detail, why you think so?  There is no
> commit message explaining what each patch does so it is difficult to
> understand why you said so?
For now I used this patch set for providing correct visibility in the 
case of access to the table with foreign partitions from many nodes in 
parallel. So I saw at this patch set as a sharding-related feature, but 
[1] shows another useful application.
CSN-based approach has weak points such as:
1. Dependency on clocks synchronization
2. Needs guarantees of monotonically increasing of the CSN in the case 
of an instance restart/crash etc.
3. We need to delay increasing of OldestXmin because it can be needed 
for a transaction snapshot at another node.
So I do not have full conviction that it will be better than a single 
distributed transaction manager.
   Also, can you let us know if this
> supports 2PC in some way and if so how is it different from what the
> other thread on the same topic [1] is trying to achieve?
Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains 
2PC machinery. Now I'd not judge which approach is better.
  Also, I
> would like to know if the patch related to CSN based snapshot [2] is a
> precursor for this, if not, then is it any way related to this patch
> because I see the latest reply on that thread [2] which says it is an
> infrastructure of sharding feature but I don't understand completely
> whether these patches are related?
I need some time to study this patch. At first sight it is different.
> 
> Basically, there seem to be three threads, first, this one and then
> [1] and [2] which seems to be doing the work for sharding feature but
> there is no clear explanation anywhere if these are anyway related or
> whether combining all these three we are aiming for a solution for
> atomic commit and atomic visibility.
It can be useful to study all approaches.
> 
> I am not sure if you know answers to all these questions so I added
> the people who seem to be working on the other two patches.  I am also
> afraid that if there is any duplicate or conflicting work going on in
> these threads so we should try to find that as well.
Ok
> 
> 
> [1] -
https://www.postgresql.org/message-id/CA%2Bfd4k4v%2BKdofMyN%2BjnOia8-7rto8tsh9Zs3dd7kncvHp12WYw%40mail.gmail.com
> [2] - https://www.postgresql.org/message-id/2020061911294657960322%40highgo.ca
> 

[1] 
https://www.postgresql.org/message-id/flat/20200301083601.ews6hz5dduc3w2se%40alap3.anarazel.de

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com



Re: Global snapshots

From
"movead.li@highgo.ca"
Date:

>> would like to know if the patch related to CSN based snapshot [2] is a
>> precursor for this, if not, then is it any way related to this patch
>> because I see the latest reply on that thread [2] which says it is an
>> infrastructure of sharding feature but I don't understand completely
>> whether these patches are related?
>I need some time to study this patch.. At first sight it is different.

This patch[2] is almost base on [3], because I think [1] is talking about 2PC
and FDW, so this patch focus on CSN only and I detach the global snapshot
part and FDW part from the [1] patch. 

I notice CSN will not survival after a restart in [1] patch, I think it may not the
right way, may be it is what in last mail "Needs guarantees of monotonically
increasing of the CSN in the case of an instance restart/crash etc" so I try to
add wal support for CSN on this patch.

That's why this thread exist.


Regards,
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Global snapshots

From
Bruce Momjian
Date:
On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead.li@highgo.ca wrote:
> 
> >> would like to know if the patch related to CSN based snapshot [2] is a
> >> precursor for this, if not, then is it any way related to this patch
> >> because I see the latest reply on that thread [2] which says it is an
> >> infrastructure of sharding feature but I don't understand completely
> >> whether these patches are related?
> >I need some time to study this patch.. At first sight it is different.
> 
> This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> and FDW, so this patch focus on CSN only and I detach the global snapshot
> part and FDW part from the [1] patch. 
> 
> I notice CSN will not survival after a restart in [1] patch, I think it may not
> the
> right way, may be it is what in last mail "Needs guarantees of monotonically
> increasing of the CSN in the case of an instance restart/crash etc" so I try to
> add wal support for CSN on this patch.
> 
> That's why this thread exist.

I was certainly missing how these items fit together.  Sharding needs
parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
snapshots, we need CSN.  This new sharding wiki pages has more details:

    https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

After all that is done, we will need optimizer improvements and shard
management tooling.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Global snapshots

From
Amit Kapila
Date:
On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
>
> On 6/19/20 11:48 AM, Amit Kapila wrote:
> > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> > <a.lepikhov@postgrespro.ru> wrote:
> >> On 09.06.2020 11:41, Fujii Masao wrote:
> >>> The patches seem not to be registered in CommitFest yet.
> >>> Are you planning to do that?
> >> Not now. It is a sharding-related feature. I'm not sure that this
> >> approach is fully consistent with the sharding way now.
> > Can you please explain in detail, why you think so?  There is no
> > commit message explaining what each patch does so it is difficult to
> > understand why you said so?
> For now I used this patch set for providing correct visibility in the
> case of access to the table with foreign partitions from many nodes in
> parallel. So I saw at this patch set as a sharding-related feature, but
> [1] shows another useful application.
> CSN-based approach has weak points such as:
> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed
> for a transaction snapshot at another node.
>

So, is anyone working on improving these parts of the patch.  AFAICS
from what Bruce has shared [1], some people from HighGo are working on
it but I don't see any discussion of that yet.

> So I do not have full conviction that it will be better than a single
> distributed transaction manager.
>

When you say "single distributed transaction manager"  do you mean
something like pg_dtm which is inspired by Postgres-XL?

>    Also, can you let us know if this
> > supports 2PC in some way and if so how is it different from what the
> > other thread on the same topic [1] is trying to achieve?
> Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> 2PC machinery. Now I'd not judge which approach is better.
>

Yeah, I have studied both the approaches a little and I feel the main
difference seems to be that in this patch atomicity is tightly coupled
with how we achieve global visibility, basically in this patch "all
running transactions are marked as InDoubt on all nodes in prepare
phase, and after that, each node commit it and stamps each xid with a
given GlobalCSN.".  There are no separate APIs for
prepare/commit/rollback exposed by postgres_fdw as we do it in the
approach followed by Sawada-San's patch.  It seems to me in the patch
in this email one of postgres_fdw node can be a sort of coordinator
which prepares and commit the transaction on all other nodes whereas
that is not true in Sawada-San's patch (where the coordinator is a
local Postgres node, am I right Sawada-San?).  OTOH, Sawada-San's
patch has advanced concepts like a resolver process that can
commit/abort the transactions later.  I couldn't still get a complete
grip of both patches so difficult to say which is better approach but
I think at the least we should have some discussion.

I feel if Sawada-San or someone involved in another patch also once
studies this approach and try to come up with some form of comparison
then we might be able to make better decision.  It is possible that
there are few good things in each approach which we can use.

>   Also, I
> > would like to know if the patch related to CSN based snapshot [2] is a
> > precursor for this, if not, then is it any way related to this patch
> > because I see the latest reply on that thread [2] which says it is an
> > infrastructure of sharding feature but I don't understand completely
> > whether these patches are related?
> I need some time to study this patch. At first sight it is different.
>

I feel the opposite.  I think it has extracted some stuff from this
patch series and extended the same.

Thanks for the inputs.  I feel inputs from you and others who were
involved in this project will be really helpful to move this project
forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Amit Kapila
Date:
On Sat, Jun 20, 2020 at 5:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> So, is anyone working on improving these parts of the patch.  AFAICS
> from what Bruce has shared [1],
>

oops, forgot to share the link [1] -
https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Amit Kapila
Date:
On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead.li@highgo.ca wrote:
> >
> > >> would like to know if the patch related to CSN based snapshot [2] is a
> > >> precursor for this, if not, then is it any way related to this patch
> > >> because I see the latest reply on that thread [2] which says it is an
> > >> infrastructure of sharding feature but I don't understand completely
> > >> whether these patches are related?
> > >I need some time to study this patch.. At first sight it is different.
> >
> > This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > part and FDW part from the [1] patch.
> >
> > I notice CSN will not survival after a restart in [1] patch, I think it may not
> > the
> > right way, may be it is what in last mail "Needs guarantees of monotonically
> > increasing of the CSN in the case of an instance restart/crash etc" so I try to
> > add wal support for CSN on this patch.
> >
> > That's why this thread exist.
>
> I was certainly missing how these items fit together.  Sharding needs
> parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
> snapshots, we need CSN.  This new sharding wiki pages has more details:
>
>         https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
>

Thanks for maintaining this page.  It is quite helpful!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Bruce Momjian
Date:
On Sat, Jun 20, 2020 at 05:54:18PM +0530, Amit Kapila wrote:
> On Fri, Jun 19, 2020 at 6:33 PM Bruce Momjian <bruce@momjian.us> wrote:
> >
> > On Fri, Jun 19, 2020 at 05:03:20PM +0800, movead.li@highgo.ca wrote:
> > >
> > > >> would like to know if the patch related to CSN based snapshot [2] is a
> > > >> precursor for this, if not, then is it any way related to this patch
> > > >> because I see the latest reply on that thread [2] which says it is an
> > > >> infrastructure of sharding feature but I don't understand completely
> > > >> whether these patches are related?
> > > >I need some time to study this patch.. At first sight it is different.
> > >
> > > This patch[2] is almost base on [3], because I think [1] is talking about 2PC
> > > and FDW, so this patch focus on CSN only and I detach the global snapshot
> > > part and FDW part from the [1] patch.
> > >
> > > I notice CSN will not survival after a restart in [1] patch, I think it may not
> > > the
> > > right way, may be it is what in last mail "Needs guarantees of monotonically
> > > increasing of the CSN in the case of an instance restart/crash etc" so I try to
> > > add wal support for CSN on this patch.
> > >
> > > That's why this thread exist.
> >
> > I was certainly missing how these items fit together.  Sharding needs
> > parallel FDWs, atomic commits, and atomic snapshots.  To get atomic
> > snapshots, we need CSN.  This new sharding wiki pages has more details:
> >
> >         https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding
> >
> 
> Thanks for maintaining this page.  It is quite helpful!

Ahsan Hadi <ahsan.hadi@highgo.ca> created that page, and I just made a
few wording edits.  Ahsan is copying information from this older
sharding wiki page:

    https://wiki.postgresql.org/wiki/Built-in_Sharding

to the new one you listed above.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Global snapshots

From
Bruce Momjian
Date:
On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision.  It is possible that
> there are few good things in each approach which we can use.

Agreed. Postgres-XL code is under the Postgres license:

    Postgres-XL is released under the PostgreSQL License, a liberal Open
    Source license, similar to the BSD or MIT licenses.

and even says they want it moved into Postgres core:

    https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/

    Postgres-XL is a massively parallel database built on top of,
    and very closely compatible with PostgreSQL 9.5 and its set of advanced
    features. Postgres-XL is fully open source and many parts of it will
    feed back directly or indirectly into later releases of PostgreSQL, as
    we begin to move towards a fully parallel sharded version of core PostgreSQL.

so we should understand what can be used from it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Global snapshots

From
Amit Kapila
Date:
On Mon, Jun 22, 2020 at 8:36 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Sat, Jun 20, 2020 at 05:51:21PM +0530, Amit Kapila wrote:
> > I feel if Sawada-San or someone involved in another patch also once
> > studies this approach and try to come up with some form of comparison
> > then we might be able to make better decision.  It is possible that
> > there are few good things in each approach which we can use.
>
> Agreed. Postgres-XL code is under the Postgres license:
>
>         Postgres-XL is released under the PostgreSQL License, a liberal Open
>         Source license, similar to the BSD or MIT licenses.
>
> and even says they want it moved into Postgres core:
>
>         https://www.postgres-xl.org/2017/08/postgres-xl-9-5-r1-6-announced/
>
>         Postgres-XL is a massively parallel database built on top of,
>         and very closely compatible with PostgreSQL 9.5 and its set of advanced
>         features. Postgres-XL is fully open source and many parts of it will
>         feed back directly or indirectly into later releases of PostgreSQL, as
>         we begin to move towards a fully parallel sharded version of core PostgreSQL.
>
> so we should understand what can be used from it.
>

+1.  I think that will be quite useful.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Masahiko Sawada
Date:
On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
> >
> > On 6/19/20 11:48 AM, Amit Kapila wrote:
> > > On Wed, Jun 10, 2020 at 8:36 AM Andrey V. Lepikhov
> > > <a.lepikhov@postgrespro.ru> wrote:
> > >> On 09.06.2020 11:41, Fujii Masao wrote:
> > >>> The patches seem not to be registered in CommitFest yet.
> > >>> Are you planning to do that?
> > >> Not now. It is a sharding-related feature. I'm not sure that this
> > >> approach is fully consistent with the sharding way now.
> > > Can you please explain in detail, why you think so?  There is no
> > > commit message explaining what each patch does so it is difficult to
> > > understand why you said so?
> > For now I used this patch set for providing correct visibility in the
> > case of access to the table with foreign partitions from many nodes in
> > parallel. So I saw at this patch set as a sharding-related feature, but
> > [1] shows another useful application.
> > CSN-based approach has weak points such as:
> > 1. Dependency on clocks synchronization
> > 2. Needs guarantees of monotonically increasing of the CSN in the case
> > of an instance restart/crash etc.
> > 3. We need to delay increasing of OldestXmin because it can be needed
> > for a transaction snapshot at another node.
> >
>
> So, is anyone working on improving these parts of the patch.  AFAICS
> from what Bruce has shared [1], some people from HighGo are working on
> it but I don't see any discussion of that yet.
>
> > So I do not have full conviction that it will be better than a single
> > distributed transaction manager.
> >
>
> When you say "single distributed transaction manager"  do you mean
> something like pg_dtm which is inspired by Postgres-XL?
>
> >    Also, can you let us know if this
> > > supports 2PC in some way and if so how is it different from what the
> > > other thread on the same topic [1] is trying to achieve?
> > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > 2PC machinery. Now I'd not judge which approach is better.
> >
>

Sorry for being late.

> Yeah, I have studied both the approaches a little and I feel the main
> difference seems to be that in this patch atomicity is tightly coupled
> with how we achieve global visibility, basically in this patch "all
> running transactions are marked as InDoubt on all nodes in prepare
> phase, and after that, each node commit it and stamps each xid with a
> given GlobalCSN.".  There are no separate APIs for
> prepare/commit/rollback exposed by postgres_fdw as we do it in the
> approach followed by Sawada-San's patch.  It seems to me in the patch
> in this email one of postgres_fdw node can be a sort of coordinator
> which prepares and commit the transaction on all other nodes whereas
> that is not true in Sawada-San's patch (where the coordinator is a
> local Postgres node, am I right Sawada-San?).

Yeah, where to manage foreign transactions is different: postgres_fdw
manages foreign transactions in this patch whereas the PostgreSQL core
does that in that 2PC patch.

>
> I feel if Sawada-San or someone involved in another patch also once
> studies this approach and try to come up with some form of comparison
> then we might be able to make better decision.  It is possible that
> there are few good things in each approach which we can use.
>

I studied this patch and did a simple comparison between this patch
(0002 patch) and my 2PC patch.

In terms of atomic commit, the features that are not implemented in
this patch but in the 2PC patch are:

* Crash safe.
* PREPARE TRANSACTION command support.
* Query cancel during waiting for the commit.
* Automatically in-doubt transaction resolution.

On the other hand, the feature that is implemented in this patch but
not in the 2PC patch is:

* Executing PREPARE TRANSACTION (and other commands) in parallel

When the 2PC patch was proposed, IIRC it was like this patch (0002
patch). I mean, it changed only postgres_fdw to support 2PC. But after
discussion, we changed the approach to have the core manage foreign
transaction for crash-safe. From my perspective, this patch has a
minimum implementation of 2PC to work the global snapshot feature and
has some missing features important for supporting crash-safe atomic
commit. So I personally think we should consider how to integrate this
global snapshot feature with the 2PC patch, rather than improving this
patch if we want crash-safe atomic commit.

Looking at the commit procedure with this patch:

When starting a new transaction on a foreign server, postgres_fdw
executes pg_global_snapshot_import() to import the global snapshot.
After some work, in pre-commit phase we do:

1. generate global transaction id, say 'gid'
2. execute PREPARE TRANSACTION 'gid' on all participants.
3. prepare global snapshot locally, if the local node also involves
the transaction
4. execute pg_global_snapshot_prepare('gid') for all participants

During step 2 to 4, we calculate the maximum CSN from the CSNs
returned from each pg_global_snapshot_prepare() executions.

5. assign global snapshot locally, if the local node also involves the
transaction
6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.

Then, we commit locally (i.g. mark the current transaction as
committed in clog).

After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
participants.

Considering how to integrate this global snapshot feature with the 2PC
patch, what the 2PC patch needs to at least change is to allow FDW to
store an FDW-private data that is passed to subsequent FDW transaction
API calls. Currently, in the current 2PC patch, we call Prepare API
for each participant servers one by one, and the core pass only
metadata such as ForeignServer, UserMapping, and global transaction
identifier. So it's not easy to calculate the maximum CSN across
multiple transaction API calls. I think we can change the 2PC patch to
add a void pointer into FdwXactRslvState, struct passed from the core,
in order to store FDW-private data. It's going to be the maximum CSN
in this case. That way, at the first Prepare API calls postgres_fdw
allocates the space and stores CSN to that space. And at subsequent
Prepare API calls it can calculate the maximum of csn, and then is
able to the step 3 to 6 when preparing the transaction on the last
participant. Another idea would be to change 2PC patch so that the
core passes a bunch of participants grouped by FDW.

I’ve not read this patch deeply yet and have considered it without any
coding but my first feeling is not hard to integrate this feature with
the 2PC patch.

Regards,


--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

From
Amit Kapila
Date:
On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > <a.lepikhov@postgrespro.ru> wrote:
> >
> > >    Also, can you let us know if this
> > > > supports 2PC in some way and if so how is it different from what the
> > > > other thread on the same topic [1] is trying to achieve?
> > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > 2PC machinery. Now I'd not judge which approach is better.
> > >
> >
>
> Sorry for being late.
>

No problem, your summarization, and comparisons of both approaches are
quite helpful.

>
> I studied this patch and did a simple comparison between this patch
> (0002 patch) and my 2PC patch.
>
> In terms of atomic commit, the features that are not implemented in
> this patch but in the 2PC patch are:
>
> * Crash safe.
> * PREPARE TRANSACTION command support.
> * Query cancel during waiting for the commit.
> * Automatically in-doubt transaction resolution.
>
> On the other hand, the feature that is implemented in this patch but
> not in the 2PC patch is:
>
> * Executing PREPARE TRANSACTION (and other commands) in parallel
>
> When the 2PC patch was proposed, IIRC it was like this patch (0002
> patch). I mean, it changed only postgres_fdw to support 2PC. But after
> discussion, we changed the approach to have the core manage foreign
> transaction for crash-safe. From my perspective, this patch has a
> minimum implementation of 2PC to work the global snapshot feature and
> has some missing features important for supporting crash-safe atomic
> commit. So I personally think we should consider how to integrate this
> global snapshot feature with the 2PC patch, rather than improving this
> patch if we want crash-safe atomic commit.
>

Okay, but isn't there some advantage with this approach (manage 2PC at
postgres_fdw level) as well which is that any node will be capable of
handling global transactions rather than doing them via central
coordinator?  I mean any node can do writes or reads rather than
probably routing them (at least writes) via coordinator node.  Now, I
agree that even if this advantage is there in the current approach, we
can't lose the crash-safety aspect of other approach.  Will you be
able to summarize what was the problem w.r.t crash-safety and how your
patch has dealt it?

> Looking at the commit procedure with this patch:
>
> When starting a new transaction on a foreign server, postgres_fdw
> executes pg_global_snapshot_import() to import the global snapshot.
> After some work, in pre-commit phase we do:
>
> 1. generate global transaction id, say 'gid'
> 2. execute PREPARE TRANSACTION 'gid' on all participants.
> 3. prepare global snapshot locally, if the local node also involves
> the transaction
> 4. execute pg_global_snapshot_prepare('gid') for all participants
>
> During step 2 to 4, we calculate the maximum CSN from the CSNs
> returned from each pg_global_snapshot_prepare() executions.
>
> 5. assign global snapshot locally, if the local node also involves the
> transaction
> 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
>
> Then, we commit locally (i.g. mark the current transaction as
> committed in clog).
>
> After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> participants.
>

As per my current understanding, the overall idea is as follows.  For
global transactions, pg_global_snapshot_prepare('gid') will set the
transaction status as InDoubt and generate CSN (let's call it NodeCSN)
at the node where that function is executed, it also returns the
NodeCSN to the coordinator.  Then the coordinator (the current
postgres_fdw node on which write transaction is being executed)
computes MaxCSN based on the return value (NodeCSN) of prepare
(pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
to each node.  Finally, when Commit Prepared is issued for each node
that MaxCSN will be written to each node including the current node.
So, with this idea, each node will have the same view of CSN value
corresponding to any particular transaction.

For Snapshot management, the node which receives the query generates a
CSN (CurrentCSN) and follows the simple rule that the tuple having a
xid with CSN lesser than CurrentCSN will be visible.  Now, it is
possible that when we are examining a tuple, the CSN corresponding to
xid that has written the tuple has a value as INDOUBT which will
indicate that the transaction is yet not committed on all nodes.  And
we wait till we get the valid CSN value corresponding to xid and then
use it to check if the tuple is visible.

Now, one thing to note here is that for global transactions we
primarily rely on CSN value corresponding to a transaction for its
visibility even though we still maintain CLOG for local transaction
status.

Leaving aside the incomplete parts and or flaws of the current patch,
does the above match the top-level idea of this patch?  I am not sure
if my understanding of this patch at this stage is completely correct
or whether we want to follow the approach of this patch but I think at
least lets first be sure if such a top-level idea can achieve what we
want to do here.

> Considering how to integrate this global snapshot feature with the 2PC
> patch, what the 2PC patch needs to at least change is to allow FDW to
> store an FDW-private data that is passed to subsequent FDW transaction
> API calls. Currently, in the current 2PC patch, we call Prepare API
> for each participant servers one by one, and the core pass only
> metadata such as ForeignServer, UserMapping, and global transaction
> identifier. So it's not easy to calculate the maximum CSN across
> multiple transaction API calls. I think we can change the 2PC patch to
> add a void pointer into FdwXactRslvState, struct passed from the core,
> in order to store FDW-private data. It's going to be the maximum CSN
> in this case. That way, at the first Prepare API calls postgres_fdw
> allocates the space and stores CSN to that space. And at subsequent
> Prepare API calls it can calculate the maximum of csn, and then is
> able to the step 3 to 6 when preparing the transaction on the last
> participant. Another idea would be to change 2PC patch so that the
> core passes a bunch of participants grouped by FDW.
>

IIUC with this the coordinator needs the communication with the nodes
twice at the prepare stage, once to prepare the transaction in each
node and get CSN from each node and then to communicate MaxCSN to each
node?  Also, we probably need InDoubt CSN status at prepare phase to
make snapshots and global visibility work.

> I’ve not read this patch deeply yet and have considered it without any
> coding but my first feeling is not hard to integrate this feature with
> the 2PC patch.
>

Okay.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Masahiko Sawada
Date:
On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 3, 2020 at 12:18 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Sat, 20 Jun 2020 at 21:21, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Jun 19, 2020 at 1:42 PM Andrey V. Lepikhov
> > > <a.lepikhov@postgrespro.ru> wrote:
> > >
> > > >    Also, can you let us know if this
> > > > > supports 2PC in some way and if so how is it different from what the
> > > > > other thread on the same topic [1] is trying to achieve?
> > > > Yes, the patch '0003-postgres_fdw-support-for-global-snapshots' contains
> > > > 2PC machinery. Now I'd not judge which approach is better.
> > > >
> > >
> >
> > Sorry for being late.
> >
>
> No problem, your summarization, and comparisons of both approaches are
> quite helpful.
>
> >
> > I studied this patch and did a simple comparison between this patch
> > (0002 patch) and my 2PC patch.
> >
> > In terms of atomic commit, the features that are not implemented in
> > this patch but in the 2PC patch are:
> >
> > * Crash safe.
> > * PREPARE TRANSACTION command support.
> > * Query cancel during waiting for the commit.
> > * Automatically in-doubt transaction resolution.
> >
> > On the other hand, the feature that is implemented in this patch but
> > not in the 2PC patch is:
> >
> > * Executing PREPARE TRANSACTION (and other commands) in parallel
> >
> > When the 2PC patch was proposed, IIRC it was like this patch (0002
> > patch). I mean, it changed only postgres_fdw to support 2PC. But after
> > discussion, we changed the approach to have the core manage foreign
> > transaction for crash-safe. From my perspective, this patch has a
> > minimum implementation of 2PC to work the global snapshot feature and
> > has some missing features important for supporting crash-safe atomic
> > commit. So I personally think we should consider how to integrate this
> > global snapshot feature with the 2PC patch, rather than improving this
> > patch if we want crash-safe atomic commit.
> >
>
> Okay, but isn't there some advantage with this approach (manage 2PC at
> postgres_fdw level) as well which is that any node will be capable of
> handling global transactions rather than doing them via central
> coordinator?  I mean any node can do writes or reads rather than
> probably routing them (at least writes) via coordinator node.

The postgres server where the client started the transaction works as
the coordinator node. I think this is true for both this patch and
that 2PC patch. From the perspective of atomic commit, any node will
be capable of handling global transactions in both approaches.

>  Now, I
> agree that even if this advantage is there in the current approach, we
> can't lose the crash-safety aspect of other approach.  Will you be
> able to summarize what was the problem w.r.t crash-safety and how your
> patch has dealt it?

Since this patch proceeds 2PC without any logging, foreign
transactions prepared on foreign servers are left over without any
clues if the coordinator crashes during commit. Therefore, after
restart, the user will need to find and resolve in-doubt foreign
transactions manually.

In that 2PC patch, the information of foreign transactions is WAL
logged before PREPARE TRANSACTION. So even if the coordinator crashes
after preparing some foreign transactions, the prepared foreign
transactions are recovered during crash recovery, and then the
transaction resolver resolves them automatically or the user also can
resolve them. The user doesn't need to check other participants node
to resolve in-doubt foreign transactions. Also, since the foreign
transaction information is replicated to physical standbys the new
master can take over resolving in-doubt transactions.

>
> > Looking at the commit procedure with this patch:
> >
> > When starting a new transaction on a foreign server, postgres_fdw
> > executes pg_global_snapshot_import() to import the global snapshot.
> > After some work, in pre-commit phase we do:
> >
> > 1. generate global transaction id, say 'gid'
> > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > 3. prepare global snapshot locally, if the local node also involves
> > the transaction
> > 4. execute pg_global_snapshot_prepare('gid') for all participants
> >
> > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > returned from each pg_global_snapshot_prepare() executions.
> >
> > 5. assign global snapshot locally, if the local node also involves the
> > transaction
> > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> >
> > Then, we commit locally (i.g. mark the current transaction as
> > committed in clog).
> >
> > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > participants.
> >
>
> As per my current understanding, the overall idea is as follows.  For
> global transactions, pg_global_snapshot_prepare('gid') will set the
> transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> at the node where that function is executed, it also returns the
> NodeCSN to the coordinator.  Then the coordinator (the current
> postgres_fdw node on which write transaction is being executed)
> computes MaxCSN based on the return value (NodeCSN) of prepare
> (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> to each node.  Finally, when Commit Prepared is issued for each node
> that MaxCSN will be written to each node including the current node.
> So, with this idea, each node will have the same view of CSN value
> corresponding to any particular transaction.
>
> For Snapshot management, the node which receives the query generates a
> CSN (CurrentCSN) and follows the simple rule that the tuple having a
> xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> possible that when we are examining a tuple, the CSN corresponding to
> xid that has written the tuple has a value as INDOUBT which will
> indicate that the transaction is yet not committed on all nodes.  And
> we wait till we get the valid CSN value corresponding to xid and then
> use it to check if the tuple is visible.
>
> Now, one thing to note here is that for global transactions we
> primarily rely on CSN value corresponding to a transaction for its
> visibility even though we still maintain CLOG for local transaction
> status.
>
> Leaving aside the incomplete parts and or flaws of the current patch,
> does the above match the top-level idea of this patch?

I'm still studying this patch but your understanding seems right to me.

> I am not sure
> if my understanding of this patch at this stage is completely correct
> or whether we want to follow the approach of this patch but I think at
> least lets first be sure if such a top-level idea can achieve what we
> want to do here.
>
> > Considering how to integrate this global snapshot feature with the 2PC
> > patch, what the 2PC patch needs to at least change is to allow FDW to
> > store an FDW-private data that is passed to subsequent FDW transaction
> > API calls. Currently, in the current 2PC patch, we call Prepare API
> > for each participant servers one by one, and the core pass only
> > metadata such as ForeignServer, UserMapping, and global transaction
> > identifier. So it's not easy to calculate the maximum CSN across
> > multiple transaction API calls. I think we can change the 2PC patch to
> > add a void pointer into FdwXactRslvState, struct passed from the core,
> > in order to store FDW-private data. It's going to be the maximum CSN
> > in this case. That way, at the first Prepare API calls postgres_fdw
> > allocates the space and stores CSN to that space. And at subsequent
> > Prepare API calls it can calculate the maximum of csn, and then is
> > able to the step 3 to 6 when preparing the transaction on the last
> > participant. Another idea would be to change 2PC patch so that the
> > core passes a bunch of participants grouped by FDW.
> >
>
> IIUC with this the coordinator needs the communication with the nodes
> twice at the prepare stage, once to prepare the transaction in each
> node and get CSN from each node and then to communicate MaxCSN to each
> node?

Yes, I think so too.

> Also, we probably need InDoubt CSN status at prepare phase to
> make snapshots and global visibility work.

I think it depends on how global CSN feature works.

For instance, in that 2PC patch, if the coordinator crashes during
preparing a foreign transaction, the global transaction manager
recovers and regards it as "prepared" regardless of the foreign
transaction actually having been prepared. And it sends ROLLBACK
PREPARED after recovery completed. With global CSN patch, as you
mentioned, at prepare phase the coordinator needs to communicate
participants twice other than sending PREPARE TRANSACTION:
pg_global_snapshot_prepare() and pg_global_snapshot_assign().

If global CSN patch needs different cleanup work depending on the CSN
status, we will need InDoubt CSN status so that the global transaction
manager can distinguish between a foreign transaction that has
executed pg_global_snapshot_prepare() and the one that has executed
pg_global_snapshot_assign().

On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
PREPARED in that case, I think we don't need InDoubt CSN status. There
is no difference between those foreign transactions from the global
transaction manager perspective.

As far as I read the patch, on failure postgres_fdw simply send
ROLLBACK PREPARED to participants, and there seems no additional work
other than that. I might be missing something.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

From
Amit Kapila
Date:
On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Okay, but isn't there some advantage with this approach (manage 2PC at
> > postgres_fdw level) as well which is that any node will be capable of
> > handling global transactions rather than doing them via central
> > coordinator?  I mean any node can do writes or reads rather than
> > probably routing them (at least writes) via coordinator node.
>
> The postgres server where the client started the transaction works as
> the coordinator node. I think this is true for both this patch and
> that 2PC patch. From the perspective of atomic commit, any node will
> be capable of handling global transactions in both approaches.
>

Okay, but then probably we need to ensure that GID has to be unique
even if that gets generated on different nodes?  I don't know if that
is ensured.

> >  Now, I
> > agree that even if this advantage is there in the current approach, we
> > can't lose the crash-safety aspect of other approach.  Will you be
> > able to summarize what was the problem w.r.t crash-safety and how your
> > patch has dealt it?
>
> Since this patch proceeds 2PC without any logging, foreign
> transactions prepared on foreign servers are left over without any
> clues if the coordinator crashes during commit. Therefore, after
> restart, the user will need to find and resolve in-doubt foreign
> transactions manually.
>

Okay, but is it because we can't directly WAL log in postgres_fdw or
there is some other reason for not doing so?

>
> >
> > > Looking at the commit procedure with this patch:
> > >
> > > When starting a new transaction on a foreign server, postgres_fdw
> > > executes pg_global_snapshot_import() to import the global snapshot.
> > > After some work, in pre-commit phase we do:
> > >
> > > 1. generate global transaction id, say 'gid'
> > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > 3. prepare global snapshot locally, if the local node also involves
> > > the transaction
> > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > >
> > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > returned from each pg_global_snapshot_prepare() executions.
> > >
> > > 5. assign global snapshot locally, if the local node also involves the
> > > transaction
> > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > >
> > > Then, we commit locally (i.g. mark the current transaction as
> > > committed in clog).
> > >
> > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > participants.
> > >
> >
> > As per my current understanding, the overall idea is as follows.  For
> > global transactions, pg_global_snapshot_prepare('gid') will set the
> > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > at the node where that function is executed, it also returns the
> > NodeCSN to the coordinator.  Then the coordinator (the current
> > postgres_fdw node on which write transaction is being executed)
> > computes MaxCSN based on the return value (NodeCSN) of prepare
> > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > to each node.  Finally, when Commit Prepared is issued for each node
> > that MaxCSN will be written to each node including the current node.
> > So, with this idea, each node will have the same view of CSN value
> > corresponding to any particular transaction.
> >
> > For Snapshot management, the node which receives the query generates a
> > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > possible that when we are examining a tuple, the CSN corresponding to
> > xid that has written the tuple has a value as INDOUBT which will
> > indicate that the transaction is yet not committed on all nodes.  And
> > we wait till we get the valid CSN value corresponding to xid and then
> > use it to check if the tuple is visible.
> >
> > Now, one thing to note here is that for global transactions we
> > primarily rely on CSN value corresponding to a transaction for its
> > visibility even though we still maintain CLOG for local transaction
> > status.
> >
> > Leaving aside the incomplete parts and or flaws of the current patch,
> > does the above match the top-level idea of this patch?
>
> I'm still studying this patch but your understanding seems right to me.
>

Cool. While studying, if you can try to think whether this approach is
different from the global coordinator based approach then it would be
great.  Here is my initial thought apart from other reasons the global
coordinator based design can help us to do the global transaction
management and snapshots.  It can allocate xids for each transaction
and then collect the list of running xacts (or CSN) from each node and
then prepare a global snapshot that can be used to perform any
transaction.

OTOH, in the design proposed in this patch, we don't need any
coordinator to manage transactions and snapshots because each node's
current CSN will be sufficient for snapshot and visibility as
explained above.  Now, sure this assumes that there is no clock skew
on different nodes or somehow we take care of the same (Note that in
the proposed patch the CSN is a timestamp.).

> > I am not sure
> > if my understanding of this patch at this stage is completely correct
> > or whether we want to follow the approach of this patch but I think at
> > least lets first be sure if such a top-level idea can achieve what we
> > want to do here.
> >
> > > Considering how to integrate this global snapshot feature with the 2PC
> > > patch, what the 2PC patch needs to at least change is to allow FDW to
> > > store an FDW-private data that is passed to subsequent FDW transaction
> > > API calls. Currently, in the current 2PC patch, we call Prepare API
> > > for each participant servers one by one, and the core pass only
> > > metadata such as ForeignServer, UserMapping, and global transaction
> > > identifier. So it's not easy to calculate the maximum CSN across
> > > multiple transaction API calls. I think we can change the 2PC patch to
> > > add a void pointer into FdwXactRslvState, struct passed from the core,
> > > in order to store FDW-private data. It's going to be the maximum CSN
> > > in this case. That way, at the first Prepare API calls postgres_fdw
> > > allocates the space and stores CSN to that space. And at subsequent
> > > Prepare API calls it can calculate the maximum of csn, and then is
> > > able to the step 3 to 6 when preparing the transaction on the last
> > > participant. Another idea would be to change 2PC patch so that the
> > > core passes a bunch of participants grouped by FDW.
> > >
> >
> > IIUC with this the coordinator needs the communication with the nodes
> > twice at the prepare stage, once to prepare the transaction in each
> > node and get CSN from each node and then to communicate MaxCSN to each
> > node?
>
> Yes, I think so too.
>
> > Also, we probably need InDoubt CSN status at prepare phase to
> > make snapshots and global visibility work.
>
> I think it depends on how global CSN feature works.
>
> For instance, in that 2PC patch, if the coordinator crashes during
> preparing a foreign transaction, the global transaction manager
> recovers and regards it as "prepared" regardless of the foreign
> transaction actually having been prepared. And it sends ROLLBACK
> PREPARED after recovery completed. With global CSN patch, as you
> mentioned, at prepare phase the coordinator needs to communicate
> participants twice other than sending PREPARE TRANSACTION:
> pg_global_snapshot_prepare() and pg_global_snapshot_assign().
>
> If global CSN patch needs different cleanup work depending on the CSN
> status, we will need InDoubt CSN status so that the global transaction
> manager can distinguish between a foreign transaction that has
> executed pg_global_snapshot_prepare() and the one that has executed
> pg_global_snapshot_assign().
>
> On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
> PREPARED in that case, I think we don't need InDoubt CSN status. There
> is no difference between those foreign transactions from the global
> transaction manager perspective.
>

I think InDoubt status helps in checking visibility in the proposed
patch wherein if we find the status of the transaction as InDoubt, we
wait till we get some valid CSN for it as explained in my previous
email.  So whether we use it for Rollback/Rollback Prepared, it is
required for this design.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Global snapshots

From
Masahiko Sawada
Date:
On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jul 8, 2020 at 11:16 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Tue, 7 Jul 2020 at 15:40, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Okay, but isn't there some advantage with this approach (manage 2PC at
> > > postgres_fdw level) as well which is that any node will be capable of
> > > handling global transactions rather than doing them via central
> > > coordinator?  I mean any node can do writes or reads rather than
> > > probably routing them (at least writes) via coordinator node.
> >
> > The postgres server where the client started the transaction works as
> > the coordinator node. I think this is true for both this patch and
> > that 2PC patch. From the perspective of atomic commit, any node will
> > be capable of handling global transactions in both approaches.
> >
>
> Okay, but then probably we need to ensure that GID has to be unique
> even if that gets generated on different nodes?  I don't know if that
> is ensured.

Yes, if you mean GID is global transaction id specified to PREPARE
TRANSACTION, it has to be unique. In that 2PC patch, GID is generated
in form of 'fx_<random string>_<server oid>_<user oid>'. I believe it
can ensure uniqueness in most cases. In addition, there is FDW API to
generate an arbitrary identifier.

>
> > >  Now, I
> > > agree that even if this advantage is there in the current approach, we
> > > can't lose the crash-safety aspect of other approach.  Will you be
> > > able to summarize what was the problem w.r.t crash-safety and how your
> > > patch has dealt it?
> >
> > Since this patch proceeds 2PC without any logging, foreign
> > transactions prepared on foreign servers are left over without any
> > clues if the coordinator crashes during commit. Therefore, after
> > restart, the user will need to find and resolve in-doubt foreign
> > transactions manually.
> >
>
> Okay, but is it because we can't directly WAL log in postgres_fdw or
> there is some other reason for not doing so?

Yes, I think it is because we cannot WAL log in postgres_fdw. Maybe I
missed the point in your question. Please correct me if I missed
something.

>
> >
> > >
> > > > Looking at the commit procedure with this patch:
> > > >
> > > > When starting a new transaction on a foreign server, postgres_fdw
> > > > executes pg_global_snapshot_import() to import the global snapshot.
> > > > After some work, in pre-commit phase we do:
> > > >
> > > > 1. generate global transaction id, say 'gid'
> > > > 2. execute PREPARE TRANSACTION 'gid' on all participants.
> > > > 3. prepare global snapshot locally, if the local node also involves
> > > > the transaction
> > > > 4. execute pg_global_snapshot_prepare('gid') for all participants
> > > >
> > > > During step 2 to 4, we calculate the maximum CSN from the CSNs
> > > > returned from each pg_global_snapshot_prepare() executions.
> > > >
> > > > 5. assign global snapshot locally, if the local node also involves the
> > > > transaction
> > > > 6. execute pg_global_snapshot_assign('gid', max-csn) on all participants.
> > > >
> > > > Then, we commit locally (i.g. mark the current transaction as
> > > > committed in clog).
> > > >
> > > > After that, in post-commit phase, execute COMMIT PREPARED 'gid' on all
> > > > participants.
> > > >
> > >
> > > As per my current understanding, the overall idea is as follows.  For
> > > global transactions, pg_global_snapshot_prepare('gid') will set the
> > > transaction status as InDoubt and generate CSN (let's call it NodeCSN)
> > > at the node where that function is executed, it also returns the
> > > NodeCSN to the coordinator.  Then the coordinator (the current
> > > postgres_fdw node on which write transaction is being executed)
> > > computes MaxCSN based on the return value (NodeCSN) of prepare
> > > (pg_global_snapshot_prepare) from all nodes.  It then assigns MaxCSN
> > > to each node.  Finally, when Commit Prepared is issued for each node
> > > that MaxCSN will be written to each node including the current node.
> > > So, with this idea, each node will have the same view of CSN value
> > > corresponding to any particular transaction.
> > >
> > > For Snapshot management, the node which receives the query generates a
> > > CSN (CurrentCSN) and follows the simple rule that the tuple having a
> > > xid with CSN lesser than CurrentCSN will be visible.  Now, it is
> > > possible that when we are examining a tuple, the CSN corresponding to
> > > xid that has written the tuple has a value as INDOUBT which will
> > > indicate that the transaction is yet not committed on all nodes.  And
> > > we wait till we get the valid CSN value corresponding to xid and then
> > > use it to check if the tuple is visible.
> > >
> > > Now, one thing to note here is that for global transactions we
> > > primarily rely on CSN value corresponding to a transaction for its
> > > visibility even though we still maintain CLOG for local transaction
> > > status.
> > >
> > > Leaving aside the incomplete parts and or flaws of the current patch,
> > > does the above match the top-level idea of this patch?
> >
> > I'm still studying this patch but your understanding seems right to me.
> >
>
> Cool. While studying, if you can try to think whether this approach is
> different from the global coordinator based approach then it would be
> great.  Here is my initial thought apart from other reasons the global
> coordinator based design can help us to do the global transaction
> management and snapshots.  It can allocate xids for each transaction
> and then collect the list of running xacts (or CSN) from each node and
> then prepare a global snapshot that can be used to perform any
> transaction. OTOH, in the design proposed in this patch, we don't need any
> coordinator to manage transactions and snapshots because each node's
> current CSN will be sufficient for snapshot and visibility as
> explained above.

Yeah, my thought is the same as you. Since both approaches have strong
points and weak points I cannot mention which is a better approach,
but that 2PC patch would go well together with the design proposed in
this patch.

> Now, sure this assumes that there is no clock skew
> on different nodes or somehow we take care of the same (Note that in
> the proposed patch the CSN is a timestamp.).

As far as I read Clock-SI paper, we take care of the clock skew by
putting some waits on the transaction start and reading tuples on the
remote node.

>
> > > I am not sure
> > > if my understanding of this patch at this stage is completely correct
> > > or whether we want to follow the approach of this patch but I think at
> > > least lets first be sure if such a top-level idea can achieve what we
> > > want to do here.
> > >
> > > > Considering how to integrate this global snapshot feature with the 2PC
> > > > patch, what the 2PC patch needs to at least change is to allow FDW to
> > > > store an FDW-private data that is passed to subsequent FDW transaction
> > > > API calls. Currently, in the current 2PC patch, we call Prepare API
> > > > for each participant servers one by one, and the core pass only
> > > > metadata such as ForeignServer, UserMapping, and global transaction
> > > > identifier. So it's not easy to calculate the maximum CSN across
> > > > multiple transaction API calls. I think we can change the 2PC patch to
> > > > add a void pointer into FdwXactRslvState, struct passed from the core,
> > > > in order to store FDW-private data. It's going to be the maximum CSN
> > > > in this case. That way, at the first Prepare API calls postgres_fdw
> > > > allocates the space and stores CSN to that space. And at subsequent
> > > > Prepare API calls it can calculate the maximum of csn, and then is
> > > > able to the step 3 to 6 when preparing the transaction on the last
> > > > participant. Another idea would be to change 2PC patch so that the
> > > > core passes a bunch of participants grouped by FDW.
> > > >
> > >
> > > IIUC with this the coordinator needs the communication with the nodes
> > > twice at the prepare stage, once to prepare the transaction in each
> > > node and get CSN from each node and then to communicate MaxCSN to each
> > > node?
> >
> > Yes, I think so too.
> >
> > > Also, we probably need InDoubt CSN status at prepare phase to
> > > make snapshots and global visibility work.
> >
> > I think it depends on how global CSN feature works.
> >
> > For instance, in that 2PC patch, if the coordinator crashes during
> > preparing a foreign transaction, the global transaction manager
> > recovers and regards it as "prepared" regardless of the foreign
> > transaction actually having been prepared. And it sends ROLLBACK
> > PREPARED after recovery completed. With global CSN patch, as you
> > mentioned, at prepare phase the coordinator needs to communicate
> > participants twice other than sending PREPARE TRANSACTION:
> > pg_global_snapshot_prepare() and pg_global_snapshot_assign().
> >
> > If global CSN patch needs different cleanup work depending on the CSN
> > status, we will need InDoubt CSN status so that the global transaction
> > manager can distinguish between a foreign transaction that has
> > executed pg_global_snapshot_prepare() and the one that has executed
> > pg_global_snapshot_assign().
> >
> > On the other hand, if it's enough to just send ROLLBACK or ROLLBACK
> > PREPARED in that case, I think we don't need InDoubt CSN status. There
> > is no difference between those foreign transactions from the global
> > transaction manager perspective.
> >
>
> I think InDoubt status helps in checking visibility in the proposed
> patch wherein if we find the status of the transaction as InDoubt, we
> wait till we get some valid CSN for it as explained in my previous
> email.  So whether we use it for Rollback/Rollback Prepared, it is
> required for this design.

Yes, InDoubt status is required for checking visibility. My comment
was it's not necessary from the perspective of atomic commit.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

From
Amit Kapila
Date:
On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > Cool. While studying, if you can try to think whether this approach is
> > different from the global coordinator based approach then it would be
> > great.  Here is my initial thought apart from other reasons the global
> > coordinator based design can help us to do the global transaction
> > management and snapshots.  It can allocate xids for each transaction
> > and then collect the list of running xacts (or CSN) from each node and
> > then prepare a global snapshot that can be used to perform any
> > transaction. OTOH, in the design proposed in this patch, we don't need any
> > coordinator to manage transactions and snapshots because each node's
> > current CSN will be sufficient for snapshot and visibility as
> > explained above.
>
> Yeah, my thought is the same as you. Since both approaches have strong
> points and weak points I cannot mention which is a better approach,
> but that 2PC patch would go well together with the design proposed in
> this patch.
>

I also think with some modifications we might be able to integrate
your 2PC patch with the patches proposed here.  However, if we decide
not to pursue this approach then it is uncertain whether your proposed
patch can be further enhanced for global visibility.  Does it make
sense to dig the design of this approach a bit further so that we can
be somewhat more sure that pursuing your 2PC patch would be a good
idea and we can, in fact, enhance it later for global visibility?
AFAICS, Andrey has mentioned couple of problems with this approach
[1], the details of which I am also not sure at this stage but if we
can dig those it would be really great.

> > Now, sure this assumes that there is no clock skew
> > on different nodes or somehow we take care of the same (Note that in
> > the proposed patch the CSN is a timestamp.).
>
> As far as I read Clock-SI paper, we take care of the clock skew by
> putting some waits on the transaction start and reading tuples on the
> remote node.
>

Oh, but I am not sure if this patch is able to solve that, and if so, how?

> >
> > I think InDoubt status helps in checking visibility in the proposed
> > patch wherein if we find the status of the transaction as InDoubt, we
> > wait till we get some valid CSN for it as explained in my previous
> > email.  So whether we use it for Rollback/Rollback Prepared, it is
> > required for this design.
>
> Yes, InDoubt status is required for checking visibility. My comment
> was it's not necessary from the perspective of atomic commit.
>

True and probably we can enhance your patch for InDoubt status if required.

Thanks for moving this work forward.  I know the progress is a bit
slow due to various reasons but I think it is important to keep making
some progress.

[1] - https://www.postgresql.org/message-id/f23083b9-38d0-6126-eb6e-091816a78585%40postgrespro.ru

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello,

While I'm thinking of the following issues of the current approach Andrey raised, I'm getting puzzled and can't help
askingcertain things.  Please forgive me if I'm missing some discussions in the past.
 

> 1. Dependency on clocks synchronization
> 2. Needs guarantees of monotonically increasing of the CSN in the case 
> of an instance restart/crash etc.
> 3. We need to delay increasing of OldestXmin because it can be needed 
> for a transaction snapshot at another node.

While Clock-SI seems to be considered the best promising for global serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the only choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other organization use Clock-SI?


Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  Although I haven't understood this yet, it
insiststhat no concurrency control information including timestamps needs to be exchanged among the cluster nodes.  I'd
appreciateit if someone could give an opinion.
 

Commitment Ordering Based Distributed Concurrency Control for Bridging Single and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on Data Engineering: Interoperability in
MultidatabaseSystems (RIDE-IMS), Vienna, Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
 
https://ieeexplore.ieee.org/document/281924?arnumber=281924


The author of the above paper, Yoav Raz, seems to have had strong passion at least until 2011 about making people
believethe mightiness of Commitment Ordering (CO) for global serializability.  However, he complains (sadly) that
almostall researchers ignore his theory, as written in his following  site and wikipedia page for Commitment Ordering.
Doesanyone know why CO is ignored?
 

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


FWIW, some researchers including Michael Stonebraker evaluated the performance of various distributed concurrency
controlmethods in 2017.  Have anyone looked at this?  (I don't mean there was some promising method that we might want
toadopt.)
 

An Evaluation of Distributed Concurrency Control
Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017.
Proc. VLDB Endow. 10, 5 (January 2017), 553-564. 
https://doi.org/10.14778/3055540.3055548


Regards
Takayuki Tsunakawa


Re: Global snapshots

From
Masahiko Sawada
Date:
On Mon, 13 Jul 2020 at 20:18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jul 10, 2020 at 8:46 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Wed, 8 Jul 2020 at 21:35, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > Cool. While studying, if you can try to think whether this approach is
> > > different from the global coordinator based approach then it would be
> > > great.  Here is my initial thought apart from other reasons the global
> > > coordinator based design can help us to do the global transaction
> > > management and snapshots.  It can allocate xids for each transaction
> > > and then collect the list of running xacts (or CSN) from each node and
> > > then prepare a global snapshot that can be used to perform any
> > > transaction. OTOH, in the design proposed in this patch, we don't need any
> > > coordinator to manage transactions and snapshots because each node's
> > > current CSN will be sufficient for snapshot and visibility as
> > > explained above.
> >
> > Yeah, my thought is the same as you. Since both approaches have strong
> > points and weak points I cannot mention which is a better approach,
> > but that 2PC patch would go well together with the design proposed in
> > this patch.
> >
>
> I also think with some modifications we might be able to integrate
> your 2PC patch with the patches proposed here.  However, if we decide
> not to pursue this approach then it is uncertain whether your proposed
> patch can be further enhanced for global visibility.

Yes. I think even if we decide not to pursue this approach it's not
the reason for not pursuing the 2PC patch. if so we would need to
consider the design of 2PC patch again so it generically resolves the
atomic commit problem.

> Does it make
> sense to dig the design of this approach a bit further so that we can
> be somewhat more sure that pursuing your 2PC patch would be a good
> idea and we can, in fact, enhance it later for global visibility?

Agreed.

> AFAICS, Andrey has mentioned couple of problems with this approach
> [1], the details of which I am also not sure at this stage but if we
> can dig those it would be really great.
>
> > > Now, sure this assumes that there is no clock skew
> > > on different nodes or somehow we take care of the same (Note that in
> > > the proposed patch the CSN is a timestamp.).
> >
> > As far as I read Clock-SI paper, we take care of the clock skew by
> > putting some waits on the transaction start and reading tuples on the
> > remote node.
> >
>
> Oh, but I am not sure if this patch is able to solve that, and if so, how?

I'm not sure the details but, as far as I read the patch I guess the
transaction will sleep at GlobalSnapshotSync() when the received
global csn is greater than the local global csn.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hi Andrey san, Movead san,


From: tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com>
> While Clock-SI seems to be considered the best promising for global
> serializability here,
> 
> * Why does Clock-SI gets so much attention?  How did Clock-SI become the
> only choice?
> 
> * Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
> organization use Clock-SI?

Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.  Microsoft holds this until 2031.  I
couldn'tfind this with the keyword "Clock-SI.""
 


US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa



Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
On 7/27/20 11:22 AM, tsunakawa.takay@fujitsu.com wrote:
> Hi Andrey san, Movead san,
> 
> 
> From: tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com>
>> While Clock-SI seems to be considered the best promising for global
>> serializability here,
>>
>> * Why does Clock-SI gets so much attention?  How did Clock-SI become the
>> only choice?
>>
>> * Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
>> organization use Clock-SI?
> 
> Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.  Microsoft holds this until 2031.  I
couldn'tfind this with the keyword "Clock-SI.""
 
> 
> 
> US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
> https://patents.google.com/patent/US8356007
> 
> 
> If it is, can we circumvent this patent?
> 
> 
> Regards
> Takayuki Tsunakawa
> 
> 

Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.

-- 
regards,
Andrey Lepikhov
Postgres Professional



Re: Global snapshots

From
Alexey Kondratov
Date:
Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:
> On 7/27/20 11:22 AM, tsunakawa.takay@fujitsu.com wrote:
>> 
>> US8356007B2 - Distributed transaction management for database systems 
>> with multiversioning - Google Patents
>> https://patents.google.com/patent/US8356007
>> 
>> 
>> If it is, can we circumvent this patent?
>> 
> 
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I had a look on the patch set, although it is quite outdated, especially 
on 0003.

Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw 
is achieved by the usage of 2PC. I think that this postgres_fdw 2PC 
support should be separated from global snapshots. It could be useful to 
have such atomic distributed transactions even without a proper 
visibility, which is guaranteed by the global snapshot. Especially 
taking into account the doubts about Clock-SI and general questions 
about algorithm choosing criteria above in the thread.

Thus, I propose to split 0003 into two parts and add a separate GUC 
'postgres_fdw.use_twophase', which could be turned on independently from 
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, 
then 2PC should be forcedly turned on as well.

Second, there are some problems with errors handling in the 0003 (thanks 
to Arseny Sher for review).

+error:
+            if (!res)
+            {
+                sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
+                BroadcastCmd(sql);
+                elog(ERROR, "Failed to PREPARE transaction on remote node");
+            }

It seems that we should never reach this point, just because 
BroadcastStmt will throw an ERROR if it fails to prepare transaction on 
the foreign server:

+            if (PQresultStatus(result) != expectedStatus ||
+                (handler && !handler(result, arg)))
+            {
+                elog(WARNING, "Failed command %s: status=%d, expected status=%d", 
sql, PQresultStatus(result), expectedStatus);
+                pgfdw_report_error(ERROR, result, entry->conn, true, sql);
+                allOk = false;
+            }

Moreover, It doesn't make much sense to try to abort prepared xacts, 
since if we failed to prepare it somewhere, then some foreign servers 
may become unavailable already and this doesn't provide us a 100% 
guarantee of clean up.

+    /* COMMIT open transaction of we were doing 2PC */
+    if (fdwTransState->two_phase_commit &&
+        (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
+    {
+        BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
+    }

At this point, the host (local) transaction is already committed and 
there is no way to abort it gracefully. However, BroadcastCmd may rise 
an ERROR that will cause a PANIC, since it is non-recoverable state:

PANIC:  cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw 
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors 
handling issues above and tries to add proper comments everywhere. I 
think, that 0003 should be rebased on the top of it, or it could be a 
first patch in the set, since it may be used independently. What do you 
think?


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
Attachment

Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/05 3:31, Alexey Kondratov wrote:
> Hi,
> 
> On 2020-07-27 09:44, Andrey V. Lepikhov wrote:
>> On 7/27/20 11:22 AM, tsunakawa.takay@fujitsu.com wrote:
>>>
>>> US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
>>> https://patents.google.com/patent/US8356007
>>>
>>>
>>> If it is, can we circumvent this patent?
>>>
>>
>> Thank you for the research (and previous links too).
>> I haven't seen this patent before. This should be carefully studied.
> 
> I had a look on the patch set, although it is quite outdated, especially on 0003.
> 
> Two thoughts about 0003:
> 
> First, IIUC atomicity of the distributed transaction in the postgres_fdw is achieved by the usage of 2PC. I think
thatthis postgres_fdw 2PC support should be separated from global snapshots.
 

Agreed.


> It could be useful to have such atomic distributed transactions even without a proper visibility, which is guaranteed
bythe global snapshot. Especially taking into account the doubts about Clock-SI and general questions about algorithm
choosingcriteria above in the thread.
 
> 
> Thus, I propose to split 0003 into two parts and add a separate GUC 'postgres_fdw.use_twophase', which could be
turnedon independently from 'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, then 2PC should be
forcedlyturned on as well.
 
> 
> Second, there are some problems with errors handling in the 0003 (thanks to Arseny Sher for review).
> 
> +error:
> +            if (!res)
> +            {
> +                sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
> +                BroadcastCmd(sql);
> +                elog(ERROR, "Failed to PREPARE transaction on remote node");
> +            }
> 
> It seems that we should never reach this point, just because BroadcastStmt will throw an ERROR if it fails to prepare
transactionon the foreign server:
 
> 
> +            if (PQresultStatus(result) != expectedStatus ||
> +                (handler && !handler(result, arg)))
> +            {
> +                elog(WARNING, "Failed command %s: status=%d, expected status=%d", sql, PQresultStatus(result),
expectedStatus);
> +                pgfdw_report_error(ERROR, result, entry->conn, true, sql);
> +                allOk = false;
> +            }
> 
> Moreover, It doesn't make much sense to try to abort prepared xacts, since if we failed to prepare it somewhere, then
someforeign servers may become unavailable already and this doesn't provide us a 100% guarantee of clean up.
 
> 
> +    /* COMMIT open transaction of we were doing 2PC */
> +    if (fdwTransState->two_phase_commit &&
> +        (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
> +    {
> +        BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
> +    }
> 
> At this point, the host (local) transaction is already committed and there is no way to abort it gracefully. However,
BroadcastCmdmay rise an ERROR that will cause a PANIC, since it is non-recoverable state:
 
> 
> PANIC:  cannot abort transaction 487, it was already committed
> 
> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'.
Alsoit solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should
berebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you
think?

Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

Regards,

[1]
https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Global snapshots

From
Alexey Kondratov
Date:
On 2020-09-08 05:49, Fujii Masao wrote:
> On 2020/09/05 3:31, Alexey Kondratov wrote:
>> 
>> Attached is a patch, which implements a plain 2PC in the postgres_fdw 
>> and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these 
>> errors handling issues above and tries to add proper comments 
>> everywhere. I think, that 0003 should be rebased on the top of it, or 
>> it could be a first patch in the set, since it may be used 
>> independently. What do you think?
> 
> Thanks for the patch!
> 
> Sawada-san was proposing another 2PC patch at [1]. Do you have any 
> thoughts
> about pros and cons between your patch and Sawada-san's?
> 
> [1]
> https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are 
two major differences:

1. There is a built-in foreign xacts resolver in the [1], which should 
be much more convenient from the end-user perspective. It involves huge 
in-core changes and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve all 
foreign prepared xacts on the Postgres' own side with a 100% guarantee. 
Imagine a situation when the coordinator node is actually a HA cluster 
group (primary + sync + async replica) and it failed just after PREPARE 
stage of after local COMMIT. In that case all foreign xacts will be left 
in the prepared state. After failover process complete synchronous 
replica will become a new primary. Would it have all required info to 
properly resolve orphan prepared xacts?

Probably, this situation is handled properly in the [1], but I've not 
yet finished a thorough reading of the patch set, though it has a great 
doc!

On the other hand, previous 0003 and my proposed patch rely on either 
manual resolution of hung prepared xacts or usage of external 
monitor/resolver. This approach is much simpler from the in-core 
perspective, but doesn't look as complete as [1] though.

2. In the patch from this thread all 2PC logic sit in the postgres_fdw, 
while [1] tries to put it into the generic fdw core, which also feels 
like a more general and architecturally correct way. However, how many 
from the currently available dozens of various FDWs are capable to 
perform 2PC? And how many of them are maintained well enough to adopt 
this new API? This is not an argument against [1] actually, since 
postgres_fdw is known to be the most advanced FDW and an early adopter 
of new feature, just a little doubt about a usefulness of this 
preliminary generalisation.

Anyway, I think that [1] is a great work and really hope to find more 
time to investigate it deeper later this year.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/08 19:36, Alexey Kondratov wrote:
> On 2020-09-08 05:49, Fujii Masao wrote:
>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>
>>> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'.
Alsoit solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should
berebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you
think?
>>
>> Thanks for the patch!
>>
>> Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
>> about pros and cons between your patch and Sawada-san's?
>>
>> [1]
>> https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
> 
> Thank you for the link!
> 
> After a quick look on the Sawada-san's patch set I think that there are two major differences:

Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.


> 
> 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user
perspective.It involves huge in-core changes and additional complexity that is of course worth of.
 
> 
> However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own
sidewith a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync
+async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be
leftin the prepared state. After failover process complete synchronous replica will become a new primary. Would it have
allrequired info to properly resolve orphan prepared xacts?
 

IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.


> 
> Probably, this situation is handled properly in the [1], but I've not yet finished a thorough reading of the patch
set,though it has a great doc!
 
> 
> On the other hand, previous 0003 and my proposed patch rely on either manual resolution of hung prepared xacts or
usageof external monitor/resolver. This approach is much simpler from the in-core perspective, but doesn't look as
completeas [1] though.
 
> 
> 2. In the patch from this thread all 2PC logic sit in the postgres_fdw, while [1] tries to put it into the generic
fdwcore, which also feels like a more general and architecturally correct way. However, how many from the currently
availabledozens of various FDWs are capable to perform 2PC? And how many of them are maintained well enough to adopt
thisnew API? This is not an argument against [1] actually, since postgres_fdw is known to be the most advanced FDW and
anearly adopter of new feature, just a little doubt about a usefulness of this preliminary generalisation.
 

If we implement 2PC feature only for PostgreSQL sharding using
postgres_fdw, IMO it's ok to support only postgres_fdw.
But if we implement 2PC as the improvement on FDW independently
from PostgreSQL sharding and global visibility, I think that it's
necessary to support other FDW. I'm not sure how many FDW
actually will support this new 2PC interface. But if the interface is
not so complicated, I *guess* some FDW will support it in the near future.

Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()). Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().


> 
> Anyway, I think that [1] is a great work and really hope to find more time to investigate it deeper later this year.

I'm sure your work is also great! I hope we can discuss the design
of 2PC feature together!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Global snapshots

From
Alexey Kondratov
Date:
On 2020-09-08 14:48, Fujii Masao wrote:
> On 2020/09/08 19:36, Alexey Kondratov wrote:
>> On 2020-09-08 05:49, Fujii Masao wrote:
>>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>> 
>>>> Attached is a patch, which implements a plain 2PC in the 
>>>> postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it 
>>>> solves these errors handling issues above and tries to add proper 
>>>> comments everywhere. I think, that 0003 should be rebased on the top 
>>>> of it, or it could be a first patch in the set, since it may be used 
>>>> independently. What do you think?
>>> 
>>> Thanks for the patch!
>>> 
>>> Sawada-san was proposing another 2PC patch at [1]. Do you have any 
>>> thoughts
>>> about pros and cons between your patch and Sawada-san's?
>>> 
>>> [1]
>>> https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
>> 
>> Thank you for the link!
>> 
>> After a quick look on the Sawada-san's patch set I think that there 
>> are two major differences:
> 
> Thanks for sharing your thought! As far as I read your patch quickly,
> I basically agree with your this view.
> 
> 
>> 
>> 1. There is a built-in foreign xacts resolver in the [1], which should 
>> be much more convenient from the end-user perspective. It involves 
>> huge in-core changes and additional complexity that is of course worth 
>> of.
>> 
>> However, it's still not clear for me that it is possible to resolve 
>> all foreign prepared xacts on the Postgres' own side with a 100% 
>> guarantee. Imagine a situation when the coordinator node is actually a 
>> HA cluster group (primary + sync + async replica) and it failed just 
>> after PREPARE stage of after local COMMIT. In that case all foreign 
>> xacts will be left in the prepared state. After failover process 
>> complete synchronous replica will become a new primary. Would it have 
>> all required info to properly resolve orphan prepared xacts?
> 
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
> 

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change 
to
+rollback, therefore at this time some participants might be prepared 
whereas
+some are not prepared. The former foreign transactions need to be 
resolved
+using pg_resolve_foreign_xact() manually and the latter ends 
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.

> 
> Implementing 2PC feature only inside postgres_fdw seems to cause
> another issue; COMMIT PREPARED is issued to the remote servers
> after marking the local transaction as committed
> (i.e., ProcArrayEndTransaction()).
> 

According to the Sawada-san's v25 0002 the logic is pretty much the same 
there:

+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() 
in the CommitTransaction(). Thus, I don't see many difference between 
these approach and CallXactCallbacks() usage regarding this point.

> Is this safe? This issue happens
> because COMMIT PREPARED is issued via
> CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> is called after ProcArrayEndTransaction().
> 

Once the transaction is committed locally any ERROR (or higher level 
message) will be escalated to PANIC. And I do see possible ERROR level 
messages in the postgresCommitForeignTransaction() for example:

+    if (PQresultStatus(res) != PGRES_COMMAND_OK)
+        ereport(ERROR, (errmsg("could not commit transaction on server %s",
+                               frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we 
failed to commit one of the prepared foreign xacts, since it could be 
not so rare in the distributed system. That's why I tried to get rid of 
possible ERRORs as far as possible in my proposed patch.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Global snapshots

From
Masahiko Sawada
Date:
On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:
>
> On 2020-09-08 14:48, Fujii Masao wrote:
> > On 2020/09/08 19:36, Alexey Kondratov wrote:
> >> On 2020-09-08 05:49, Fujii Masao wrote:
> >>> On 2020/09/05 3:31, Alexey Kondratov wrote:
> >>>>
> >>>> Attached is a patch, which implements a plain 2PC in the
> >>>> postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it
> >>>> solves these errors handling issues above and tries to add proper
> >>>> comments everywhere. I think, that 0003 should be rebased on the top
> >>>> of it, or it could be a first patch in the set, since it may be used
> >>>> independently. What do you think?
> >>>
> >>> Thanks for the patch!
> >>>
> >>> Sawada-san was proposing another 2PC patch at [1]. Do you have any
> >>> thoughts
> >>> about pros and cons between your patch and Sawada-san's?
> >>>
> >>> [1]
> >>> https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
> >>
> >> Thank you for the link!
> >>
> >> After a quick look on the Sawada-san's patch set I think that there
> >> are two major differences:
> >
> > Thanks for sharing your thought! As far as I read your patch quickly,
> > I basically agree with your this view.
> >
> >
> >>
> >> 1. There is a built-in foreign xacts resolver in the [1], which should
> >> be much more convenient from the end-user perspective. It involves
> >> huge in-core changes and additional complexity that is of course worth
> >> of.
> >>
> >> However, it's still not clear for me that it is possible to resolve
> >> all foreign prepared xacts on the Postgres' own side with a 100%
> >> guarantee. Imagine a situation when the coordinator node is actually a
> >> HA cluster group (primary + sync + async replica) and it failed just
> >> after PREPARE stage of after local COMMIT. In that case all foreign
> >> xacts will be left in the prepared state. After failover process
> >> complete synchronous replica will become a new primary. Would it have
> >> all required info to properly resolve orphan prepared xacts?
> >
> > IIUC, yes, the information required for automatic resolution is
> > WAL-logged and the standby tries to resolve those orphan transactions
> > from WAL after the failover. But Sawada-san's patch provides
> > the special function for manual resolution, so there may be some cases
> > where manual resolution is necessary.
> >
>
> I've found a note about manual resolution in the v25 0002:
>
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change
> to
> +rollback, therefore at this time some participants might be prepared
> whereas
> +some are not prepared. The former foreign transactions need to be
> resolved
> +using pg_resolve_foreign_xact() manually and the latter ends
> transaction
> +in one-phase by calling RollbackForeignTransaction() API.
>
> but it's not yet clear for me.

Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.

>
> >
> > Implementing 2PC feature only inside postgres_fdw seems to cause
> > another issue; COMMIT PREPARED is issued to the remote servers
> > after marking the local transaction as committed
> > (i.e., ProcArrayEndTransaction()).
> >
>
> According to the Sawada-san's v25 0002 the logic is pretty much the same
> there:
>
> +2. Pre-Commit phase (1st phase of two-phase commit)
>
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
>
> +4. Post-Commit Phase (2nd phase of two-phase commit)
>
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact /
> FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction()
> in the CommitTransaction(). Thus, I don't see many difference between
> these approach and CallXactCallbacks() usage regarding this point.
>
> > Is this safe? This issue happens
> > because COMMIT PREPARED is issued via
> > CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
> > is called after ProcArrayEndTransaction().
> >
>
> Once the transaction is committed locally any ERROR (or higher level
> message) will be escalated to PANIC.

I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?

> And I do see possible ERROR level
> messages in the postgresCommitForeignTransaction() for example:
>
> +       if (PQresultStatus(res) != PGRES_COMMAND_OK)
> +               ereport(ERROR, (errmsg("could not commit transaction on server %s",
> +                                                          frstate->server->servername)));
>
> I don't think that it's very convenient to get a PANIC every time we
> failed to commit one of the prepared foreign xacts, since it could be
> not so rare in the distributed system. That's why I tried to get rid of
> possible ERRORs as far as possible in my proposed patch.
>

In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

From
Alexey Kondratov
Date:
On 2020-09-09 08:35, Masahiko Sawada wrote:
> On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
> <a.kondratov@postgrespro.ru> wrote:
>> 
>> On 2020-09-08 14:48, Fujii Masao wrote:
>> >
>> > IIUC, yes, the information required for automatic resolution is
>> > WAL-logged and the standby tries to resolve those orphan transactions
>> > from WAL after the failover. But Sawada-san's patch provides
>> > the special function for manual resolution, so there may be some cases
>> > where manual resolution is necessary.
>> >
>> 
>> I've found a note about manual resolution in the v25 0002:
>> 
>> +After that we prepare all foreign transactions by calling
>> +PrepareForeignTransaction() API. If we failed on any of them we 
>> change
>> to
>> +rollback, therefore at this time some participants might be prepared
>> whereas
>> +some are not prepared. The former foreign transactions need to be
>> resolved
>> +using pg_resolve_foreign_xact() manually and the latter ends
>> transaction
>> +in one-phase by calling RollbackForeignTransaction() API.
>> 
>> but it's not yet clear for me.
> 
> Sorry, the above description in README is out of date. In the v25
> patch, it's true that if a backend fails to prepare a transaction on a
> foreign server, it’s possible that some foreign transactions are
> prepared whereas others are not. But at the end of the transaction
> after changing to rollback, the process does rollback (or rollback
> prepared) all of them. So the use case of pg_resolve_foreign_xact() is
> to resolve orphaned foreign prepared transactions or to resolve a
> foreign transaction that is not resolved for some reasons, bugs etc.
> 

OK, thank you for the explanation!

>> 
>> Once the transaction is committed locally any ERROR (or higher level
>> message) will be escalated to PANIC.
> 
> I think this is true only inside the critical section and it's not
> necessarily true for all errors happening after the local commit,
> right?
> 

It's not actually related to critical section errors escalation. Any 
error in the backend after the local commit and 
ProcArrayEndTransaction() will try to abort the current transaction and 
do RecordTransactionAbort(), but it's too late to do so and PANIC will 
be risen:

    /*
     * Check that we haven't aborted halfway through 
RecordTransactionCommit.
     */
    if (TransactionIdDidCommit(xid))
        elog(PANIC, "cannot abort transaction %u, it was already committed",
             xid);

At least that's how I understand it.


>> And I do see possible ERROR level
>> messages in the postgresCommitForeignTransaction() for example:
>> 
>> +       if (PQresultStatus(res) != PGRES_COMMAND_OK)
>> +               ereport(ERROR, (errmsg("could not commit transaction 
>> on server %s",
>> +                                                          
>> frstate->server->servername)));
>> 
>> I don't think that it's very convenient to get a PANIC every time we
>> failed to commit one of the prepared foreign xacts, since it could be
>> not so rare in the distributed system. That's why I tried to get rid 
>> of
>> possible ERRORs as far as possible in my proposed patch.
>> 
> 
> In my patch, the second phase of 2PC is executed only by the resolver
> process. Therefore, even if an error would happen during committing a
> foreign prepared transaction, we just need to relaunch the resolver
> process and trying again. During that, the backend process will be
> just waiting. If a backend process raises an error after the local
> commit, the client will see transaction failure despite the local
> transaction having been committed. An error could happen even by
> palloc. So the patch uses a background worker to commit prepared
> foreign transactions, not by backend itself.
> 

Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and 
staring a discussion :) Activity from this thread moved to [1] and you 
solution with built-in resolver is discussed [2]. I'll try to take a 
look on v25 closely and write to [2] instead.


[1] 
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca

[2] 
https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com

Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/09 2:00, Alexey Kondratov wrote:
> On 2020-09-08 14:48, Fujii Masao wrote:
>> On 2020/09/08 19:36, Alexey Kondratov wrote:
>>> On 2020-09-08 05:49, Fujii Masao wrote:
>>>> On 2020/09/05 3:31, Alexey Kondratov wrote:
>>>>>
>>>>> Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'.
Alsoit solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should
berebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you
think?
>>>>
>>>> Thanks for the patch!
>>>>
>>>> Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
>>>> about pros and cons between your patch and Sawada-san's?
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/CA+fd4k4z6_B1ETEvQamwQhu4RX7XsrN5ORL7OhJ4B5B6sW-RgQ@mail.gmail.com
>>>
>>> Thank you for the link!
>>>
>>> After a quick look on the Sawada-san's patch set I think that there are two major differences:
>>
>> Thanks for sharing your thought! As far as I read your patch quickly,
>> I basically agree with your this view.
>>
>>
>>>
>>> 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user
perspective.It involves huge in-core changes and additional complexity that is of course worth of.
 
>>>
>>> However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own
sidewith a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync
+async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be
leftin the prepared state. After failover process complete synchronous replica will become a new primary. Would it have
allrequired info to properly resolve orphan prepared xacts?
 
>>
>> IIUC, yes, the information required for automatic resolution is
>> WAL-logged and the standby tries to resolve those orphan transactions
>> from WAL after the failover. But Sawada-san's patch provides
>> the special function for manual resolution, so there may be some cases
>> where manual resolution is necessary.
>>
> 
> I've found a note about manual resolution in the v25 0002:
> 
> +After that we prepare all foreign transactions by calling
> +PrepareForeignTransaction() API. If we failed on any of them we change to
> +rollback, therefore at this time some participants might be prepared whereas
> +some are not prepared. The former foreign transactions need to be resolved
> +using pg_resolve_foreign_xact() manually and the latter ends transaction
> +in one-phase by calling RollbackForeignTransaction() API.
> 
> but it's not yet clear for me.
> 
>>
>> Implementing 2PC feature only inside postgres_fdw seems to cause
>> another issue; COMMIT PREPARED is issued to the remote servers
>> after marking the local transaction as committed
>> (i.e., ProcArrayEndTransaction()).
>>
> 
> According to the Sawada-san's v25 0002 the logic is pretty much the same there:
> 
> +2. Pre-Commit phase (1st phase of two-phase commit)
> 
> +3. Commit locally
> +Once we've prepared all of them, commit the transaction locally.
> 
> +4. Post-Commit Phase (2nd phase of two-phase commit)
> 
> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / FdwXactParticipantEndTransaction happens after
ProcArrayEndTransaction()in the CommitTransaction(). Thus, I don't see many difference between these approach and
CallXactCallbacks()usage regarding this point.
 

IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
     PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
     2-1. WAL-log the commit record
     2-2. Update CLOG
     2-3. Wait for sync rep
     2-4. FdwXactWaitForResolution()
             Wait until COMMIT PREPARED commands are issued to the remote servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hi Andrey san,

From: Andrey V. Lepikhov <a.lepikhov@postgrespro.ru>> > From: tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com>
> >> While Clock-SI seems to be considered the best promising for global
> >>> > Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?
> >> 
> Thank you for the research (and previous links too).
> I haven't seen this patent before. This should be carefully studied.

I wanted to ask about this after I've published the revised scale-out design wiki, but I'm taking too long, so could
youshare your study results?  I think we need to make it clear about the patent before discussing the code.  After we
hearyour opinion, we also have to check to see if Clock-SI is patented or avoid it by modifying part of the algorithm.
Justin case we cannot use it, we have to proceed with thinking about alternatives.
 


Regards
Takayuki Tsunakawa



Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/10 10:38, tsunakawa.takay@fujitsu.com wrote:
> Hi Andrey san,
> 
> From: Andrey V. Lepikhov <a.lepikhov@postgrespro.ru>> > From: tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com>
>>>> While Clock-SI seems to be considered the best promising for global
>>>>>> Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.
>> Microsoft holds this until 2031.  I couldn't find this with the keyword
>> "Clock-SI.""
>>>
>>>
>>> US8356007B2 - Distributed transaction management for database systems
>> with multiversioning - Google Patents
>>> https://patents.google.com/patent/US8356007
>>>
>>>
>>> If it is, can we circumvent this patent?
>>>>
>> Thank you for the research (and previous links too).
>> I haven't seen this patent before. This should be carefully studied.
> 
> I wanted to ask about this after I've published the revised scale-out design wiki, but I'm taking too long, so could
youshare your study results?  I think we need to make it clear about the patent before discussing the code.
 

Yes.

But I'm concerned about that it's really hard to say there is no patent risk
around that. I'm not sure who can judge there is no patent risk,
in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
YugabyteDB, etc use the global transaction approach based on the clock
similar to Clock-SI. Since I've never heard they have the patent issues,
I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
safe, though...


>  After we hear your opinion, we also have to check to see if Clock-SI is patented or avoid it by modifying part of
thealgorithm.  Just in case we cannot use it, we have to proceed with thinking about alternatives.
 

One alternative is to add only hooks into PostgreSQL core so that we can
implement the global transaction management outside. This idea was
discussed before as the title "eXtensible Transaction Manager API".

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Fujii Masao <masao.fujii@oss.nttdata.com>
> But I'm concerned about that it's really hard to say there is no patent risk
> around that. I'm not sure who can judge there is no patent risk,
> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
> YugabyteDB, etc use the global transaction approach based on the clock
> similar to Clock-SI. Since I've never heard they have the patent issues,
> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
> safe, though...

Hm, it may be difficult to be sure that the algorithm does not violate a patent.  But it may not be difficult to know
ifthe algorithm apparently violates a patent or is highly likely (for those who know Clock-SI well.)  At least,
Andrey-sanseems to have felt that it needs careful study, so I guess he had some hunch.
 

I understand this community is sensitive to patents.  After the discussions at and after PGCon 2018, the community
concludedthat it won't accept patented technology.  In the distant past, the community released Postgres 8.0 that
containsan IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how could this could be detected, and how hard
tocope with the patent issue.  Bruce warned that we should be careful not to violate Greenplum's patents.
 

E.25. Release 8.0.2
https://www.postgresql.org/docs/8.0/release-8-0-2.html
--------------------------------------------------
New cache management algorithm 2Q replaces ARC (Tom)
This was done to avoid a pending US patent on ARC. The 2Q code might be a few percentage points slower than ARC for
somework loads. A better cache management algorithm will appear in 8.1.
 
--------------------------------------------------


I think I'll try to contact the people listed in Clock-SI paper and the Microsoft patent to ask about this.  I'm going
tohave a late summer vacation next week, so this is my summer homework?
 


> One alternative is to add only hooks into PostgreSQL core so that we can
> implement the global transaction management outside. This idea was
> discussed before as the title "eXtensible Transaction Manager API".

Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...


Regards
Takayuki Tsunakawa



Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/10 18:01, tsunakawa.takay@fujitsu.com wrote:
> From: Fujii Masao <masao.fujii@oss.nttdata.com>
>> But I'm concerned about that it's really hard to say there is no patent risk
>> around that. I'm not sure who can judge there is no patent risk,
>> in the community. Maybe no one? Anyway, I was thinking that Google Spanner,
>> YugabyteDB, etc use the global transaction approach based on the clock
>> similar to Clock-SI. Since I've never heard they have the patent issues,
>> I was just thinking Clock-SI doesn't have. No? This type of *guess* is not
>> safe, though...
> 
> Hm, it may be difficult to be sure that the algorithm does not violate a patent.  But it may not be difficult to know
ifthe algorithm apparently violates a patent or is highly likely (for those who know Clock-SI well.)  At least,
Andrey-sanseems to have felt that it needs careful study, so I guess he had some hunch.
 
> 
> I understand this community is sensitive to patents.  After the discussions at and after PGCon 2018, the community
concludedthat it won't accept patented technology.  In the distant past, the community released Postgres 8.0 that
containsan IBM's pending patent ARC, and removed it in 8.0.2.  I wonder how could this could be detected, and how hard
tocope with the patent issue.  Bruce warned that we should be careful not to violate Greenplum's patents.
 
> 
> E.25. Release 8.0.2
> https://www.postgresql.org/docs/8.0/release-8-0-2.html
> --------------------------------------------------
> New cache management algorithm 2Q replaces ARC (Tom)
> This was done to avoid a pending US patent on ARC. The 2Q code might be a few percentage points slower than ARC for
somework loads. A better cache management algorithm will appear in 8.1.
 
> --------------------------------------------------
> 
> 
> I think I'll try to contact the people listed in Clock-SI paper and the Microsoft patent to ask about this.

Thanks!


> I'm going to have a late summer vacation next week, so this is my summer homework?
> 
> 
>> One alternative is to add only hooks into PostgreSQL core so that we can
>> implement the global transaction management outside. This idea was
>> discussed before as the title "eXtensible Transaction Manager API".
> 
> Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...

But it may be worth revisiting this idea if we cannot avoid the patent issue.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Global snapshots

From
Alexey Kondratov
Date:
On 2020-09-09 20:29, Fujii Masao wrote:
> On 2020/09/09 2:00, Alexey Kondratov wrote:
>> 
>> According to the Sawada-san's v25 0002 the logic is pretty much the 
>> same there:
>> 
>> +2. Pre-Commit phase (1st phase of two-phase commit)
>> 
>> +3. Commit locally
>> +Once we've prepared all of them, commit the transaction locally.
>> 
>> +4. Post-Commit Phase (2nd phase of two-phase commit)
>> 
>> Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
>> FdwXactParticipantEndTransaction happens after 
>> ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't 
>> see many difference between these approach and CallXactCallbacks() 
>> usage regarding this point.
> 
> IIUC the commit logic in Sawada-san's patch looks like
> 
> 1. PreCommit_FdwXact()
>     PREPARE TRANSACTION command is issued
> 
> 2. RecordTransactionCommit()
>     2-1. WAL-log the commit record
>     2-2. Update CLOG
>     2-3. Wait for sync rep
>     2-4. FdwXactWaitForResolution()
>             Wait until COMMIT PREPARED commands are issued to the
> remote servers and completed.
> 
> 3. ProcArrayEndTransaction()
> 4. AtEOXact_FdwXact(true)
> 
> So ISTM that the timing of when COMMIT PREPARED is issued
> to the remote server is different between the patches.
> Am I missing something?
> 

No, you are right, sorry. At a first glance I thought that 
AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is 
only calling FdwXactParticipantEndTransaction in the abort case.


Regards
-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



Re: Global snapshots

From
Amit Kapila
Date:
On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> >> One alternative is to add only hooks into PostgreSQL core so that we can
> >> implement the global transaction management outside. This idea was
> >> discussed before as the title "eXtensible Transaction Manager API".
> >
> > Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...
>
> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>

It is not very clear what exactly we can do about the point raised by
Tsunakawa-San related to patent in this technology as I haven't seen
that discussed during other development but maybe we can try to study
a bit. One more thing I would like to bring here is that it seems to
be there have been some concerns about this idea when originally
discussed [1]. It is not very clear to me if all the concerns are
addressed or not. If one can summarize the concerns discussed and how
the latest patch is able to address those then it will be great.

Also, I am not sure but maybe global deadlock detection also needs to
be considered as that also seems to be related because it depends on
how we manage global transactions. We need to prevent deadlock among
transaction operations spanned across multiple nodes. Say a
transaction T-1 has updated row r-1 of tbl-1 on node-1 and tries to
update row r-1 of tbl-2 on node n-2. Similarly, a transaction T-2
tries to perform those two operations in reverse order. Now, this will
lead to the deadlock that spans across multiple nodes and our current
deadlock detector doesn't have that capability. Having some form of
global/distributed transaction id might help to resolve it but not
sure how it can be solved with this clock-si based algorithm.

As all these problems are related, that is why I am insisting on this
thread and other thread "Transactions involving multiple postgres
foreign servers" [2] to have a high-level idea on how the distributed
transaction management will work before we decide on a particular
approach and commit one part of that patch.

[1] - https://www.postgresql.org/message-id/21BC916B-80A1-43BF-8650-3363CCDAE09C%40postgrespro.ru
[2] -
https://www.postgresql.org/message-id/CAA4eK1J86S%3DmeivVsH%2Boy%3DTwUC%2Byr9jj2VtmmqMfYRmgs2JzUA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.



Re: Global snapshots

From
Bruce Momjian
Date:
On Tue, Sep  8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:
> Thank you for the link!
> 
> After a quick look on the Sawada-san's patch set I think that there are two
> major differences:
> 
> 1. There is a built-in foreign xacts resolver in the [1], which should be
> much more convenient from the end-user perspective. It involves huge in-core
> changes and additional complexity that is of course worth of.
> 
> However, it's still not clear for me that it is possible to resolve all
> foreign prepared xacts on the Postgres' own side with a 100% guarantee.
> Imagine a situation when the coordinator node is actually a HA cluster group
> (primary + sync + async replica) and it failed just after PREPARE stage of
> after local COMMIT. In that case all foreign xacts will be left in the
> prepared state. After failover process complete synchronous replica will
> become a new primary. Would it have all required info to properly resolve
> orphan prepared xacts?
> 
> Probably, this situation is handled properly in the [1], but I've not yet
> finished a thorough reading of the patch set, though it has a great doc!
> 
> On the other hand, previous 0003 and my proposed patch rely on either manual
> resolution of hung prepared xacts or usage of external monitor/resolver.
> This approach is much simpler from the in-core perspective, but doesn't look
> as complete as [1] though.

Have we considered how someone would clean up foreign transactions if the
coordinating server dies?  Could it be done manually?  Would an external
resolver, rather than an internal one, make this easier?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Global snapshots

From
Alexey Kondratov
Date:
On 2020-09-18 00:54, Bruce Momjian wrote:
> On Tue, Sep  8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:
>> Thank you for the link!
>> 
>> After a quick look on the Sawada-san's patch set I think that there 
>> are two
>> major differences:
>> 
>> 1. There is a built-in foreign xacts resolver in the [1], which should 
>> be
>> much more convenient from the end-user perspective. It involves huge 
>> in-core
>> changes and additional complexity that is of course worth of.
>> 
>> However, it's still not clear for me that it is possible to resolve 
>> all
>> foreign prepared xacts on the Postgres' own side with a 100% 
>> guarantee.
>> Imagine a situation when the coordinator node is actually a HA cluster 
>> group
>> (primary + sync + async replica) and it failed just after PREPARE 
>> stage of
>> after local COMMIT. In that case all foreign xacts will be left in the
>> prepared state. After failover process complete synchronous replica 
>> will
>> become a new primary. Would it have all required info to properly 
>> resolve
>> orphan prepared xacts?
>> 
>> Probably, this situation is handled properly in the [1], but I've not 
>> yet
>> finished a thorough reading of the patch set, though it has a great 
>> doc!
>> 
>> On the other hand, previous 0003 and my proposed patch rely on either 
>> manual
>> resolution of hung prepared xacts or usage of external 
>> monitor/resolver.
>> This approach is much simpler from the in-core perspective, but 
>> doesn't look
>> as complete as [1] though.
> 
> Have we considered how someone would clean up foreign transactions if 
> the
> coordinating server dies?  Could it be done manually?  Would an 
> external
> resolver, rather than an internal one, make this easier?

Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC 
with a special gid format including a xid + server identification info. 
Thus, one can select from pg_prepared_xacts, get xid and coordinator 
info, then use txid_status() on the coordinator (or ex-coordinator) to 
get transaction status and finally either commit or abort these stale 
prepared xacts. Of course this could be wrapped into some user-level 
support routines as it is done in the [1].

As for the benefits of using an external resolver, I think that there 
are some of them from the whole system perspective:

1) If one follows the logic above, then this resolver could be 
stateless, it takes all the required info from the Postgres nodes 
themselves.

2) Then you can easily put it into container, which make it easier do 
deploy to all these 'cloud' stuff like kubernetes.

3) Also you can scale resolvers independently from Postgres nodes.

I do not think that either of these points is a game changer, but we use 
a very simple external resolver altogether with [2] in our sharding 
prototype and it works just fine so far.


[1] 
https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru

-- 
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hi Andrey-san, all,

From: Andrey V. Lepikhov <a.lepikhov@postgrespro.ru>
> On 7/27/20 11:22 AM, tsunakawa.takay@fujitsu.com wrote:
> > Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.
> Microsoft holds this until 2031.  I couldn't find this with the keyword
> "Clock-SI.""
> >
> >
> > US8356007B2 - Distributed transaction management for database systems
> with multiversioning - Google Patents
> > https://patents.google.com/patent/US8356007
> >
> >
> > If it is, can we circumvent this patent?

> I haven't seen this patent before. This should be carefully studied.


I contacted 6 people individually, 3 holders of the patent and different 3 authors of the Clock-SI paper.  I got
repliesfrom two people.  (It's a regret I couldn't get a reply from the main author of Clock-SI paper.)
 

[Reply from the patent holder Per-Ake Larson]
--------------------------------------------------
Thanks for your interest in my patent. 

The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development.
Thetwo approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction
isthe same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock.
Thedifference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in 2PC.
 

Hope this helpful.

Best regards,
Per-Ake
--------------------------------------------------


[Reply from the Clock-SI author Willy Zwaenepoel]
--------------------------------------------------
Thank you for your kind words about our work.

I was unaware of this patent at the time I wrote the paper. The two came out more or less at the same time.

I am not a lawyer, so I cannot tell you if something based on Clock-SI would infringe on the Microsoft patent. The main
distinctionto me seems to be that Clock-SI is based on physical clocks, while the Microsoft patent talks about logical
clocks,but again I am not a lawyer.
 

Best regards,

Willy.
--------------------------------------------------


Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the patent?
Accordingto the patent holder, the difference between Clock-SI and the patent seems to be fewer than the similarities.
 


Regards
Takayuki Tsunakawa


Re: Global snapshots

From
Andrey Lepikhov
Date:

22.09.2020 03:47, tsunakawa.takay@fujitsu.com пишет:
> Does this make sense from your viewpoint, and can we think that we can use Clock-SI without infrindging on the
patent? According to the patent holder, the difference between Clock-SI and the patent seems to be fewer than the
similarities.
Thank you for this work!
As I can see, main development difficulties placed in other areas: CSN, 
resolver, global deadlocks, 2PC commit... I'm not lawyer too. But if we 
get remarks from the patent holders, we can rewrite our Clock-SI 
implementation.

-- 
regards,
Andrey Lepikhov
Postgres Professional



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
> Thank you for this work!
> As I can see, main development difficulties placed in other areas: CSN, resolver,
> global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from
> the patent holders, we can rewrite our Clock-SI implementation.

Yeah, I understand your feeling.  I personally don't want like patents, and don't want to be disturbed by them.  But
theworld is not friendly...  We are not a lawyer, but we have to do our best to make sure PostgreSQL will be
patent-freeby checking the technologies as engineers.
 

Among the above items, CSN is the only concerning one.  Other items are written in textbooks, well-known, and used in
otherDBMSs, so they should be free from patents.  However, CSN is not (at least to me.)  Have you checked if CSN is not
relatedto some patent?  Or is CSN or similar technology already widely used in famous software and we can regard it as
patent-free?

And please wait.  As below, the patent holder just says that Clock-SI is not based on the patent and an independent
development. He doesn't say Clock-SI does not overlap with the patent or implementing Clock-SI does not infringe on the
patent. Rather, he suggests that Clock-SI has many similarities and thus those may match the claims of the patent
(unintentionally?) I felt this is a sign of risking infringement.
 

"The answer to your question is: No, Clock-SI is not based on the patent - it was an entirely independent development.
Thetwo approaches are similar in the sense that there is no global clock, the commit time of a distributed transaction
isthe same in every partition where it modified data, and a transaction gets it snapshot timestamp from a local clock.
Thedifference is whether a distributed transaction gets its commit timestamp before or after the prepare phase in
2PC."

The timeline of events also worries me.  It seems unnatural to consider that Clock-SI and the patent are independent.

    2010/6 - 2010/9  One Clock-SI author worked for Microsoft Research as an research intern
    2010/10  Microsoft filed the patent
    2011/9 - 2011/12  The same Clock-SI author worked for Microsoft Research as an research intern
    2013  The same author moved to EPFL and published the Clock-SI paper with another author who has worked for
MicrosoftResearch since then.
 

So, could you give your opinion whether we can use Clock-SI without overlapping with the patent claims?  I also will
tryto check and see, so that I can understand your technical analysis.
 

And I've just noticed that I got in touch with another author of Clock-SI via SNS, and sent an inquiry to him.  I'll
reportagain when I have a reply.
 


Regards
Takayuki Tsunakawa



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hi Andrey san, all,

From: tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com>
> And please wait.  As below, the patent holder just says that Clock-SI is not
> based on the patent and an independent development.  He doesn't say
> Clock-SI does not overlap with the patent or implementing Clock-SI does not
> infringe on the patent.  Rather, he suggests that Clock-SI has many
> similarities and thus those may match the claims of the patent
> (unintentionally?)  I felt this is a sign of risking infringement.
> 
> "The answer to your question is: No, Clock-SI is not based on the patent - it
> was an entirely independent development. The two approaches are similar in
> the sense that there is no global clock, the commit time of a distributed
> transaction is the same in every partition where it modified data, and a
> transaction gets it snapshot timestamp from a local clock. The difference is
> whether a distributed transaction gets its commit timestamp before or after the
> prepare phase in 2PC."
> 
> The timeline of events also worries me.  It seems unnatural to consider that
> Clock-SI and the patent are independent.
> 
>     2010/6 - 2010/9  One Clock-SI author worked for Microsoft Research as
> an research intern
>     2010/10  Microsoft filed the patent
>     2011/9 - 2011/12  The same Clock-SI author worked for Microsoft
> Research as an research intern
>     2013  The same author moved to EPFL and published the Clock-SI paper
> with another author who has worked for Microsoft Research since then.
> 
> So, could you give your opinion whether we can use Clock-SI without
> overlapping with the patent claims?  I also will try to check and see, so that I
> can understand your technical analysis.
> 
> And I've just noticed that I got in touch with another author of Clock-SI via SNS,
> and sent an inquiry to him.  I'll report again when I have a reply.

I got a reply from the main author of the Clock-SI paper:

[Reply from the Clock-SI author Jiaqing Du]
--------------------------------------------------
Thanks for reaching out.

I actually did not know that Microsoft wrote a patent which is  similar to the ideas in my paper. I worked there as an
intern.My Clock-SI paper was done at my school (EPFL) after my internships at Microsoft. The paper was very loosely
relatedto my internship project at Microsoft. In a sense, the internship project at Microsoft inspired me to work on
Clock-SIafter I finished the internship. As you see in the paper, my coauthor, who is my internship host, is also from
Microsoft,but interestingly he is not on the patent :)
 

Cheers,
Jiaqing
--------------------------------------------------


Unfortunately, he also did not assert that Clock-SI does not infringe on the patent.  Rather, worrying words are mixed:
"similarto my ideas", "loosely related", "inspired".
 

Also, his internship host is the co-author of the Clock-SI paper.  That person should be Sameh Elnikety, who has been
workingfor Microsoft Research.  I also asked him about the same question, but he has been silent for about 10 days.
 

When I had a quick look, the patent appeared to be broader than Clock-SI, and Clock-SI is a concrete application of the
patent. This is just my guess, but Sameh Elnikety had known the patent and set an internship theme at Microsoft or the
researchsubject at EPFL based on it, whether he was aware or not.
 

As of now, it seems that the Clock-SI needs to be evaluated against the patent claims by two or more persons -- one
fromsomeone who knows Clock-SI well and implemented it for Postgres (Andrey-san?), and someone else who shares little
benefitwith the former person and can see it objectively.
 


Regards
Takayuki Tsunakawa


Re: Global snapshots

From
Fujii Masao
Date:

On 2020/09/17 15:56, Amit Kapila wrote:
> On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>>> One alternative is to add only hooks into PostgreSQL core so that we can
>>>> implement the global transaction management outside. This idea was
>>>> discussed before as the title "eXtensible Transaction Manager API".
>>>
>>> Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...
>>
>> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>>
> 
> It is not very clear what exactly we can do about the point raised by
> Tsunakawa-San related to patent in this technology as I haven't seen
> that discussed during other development but maybe we can try to study
> a bit. One more thing I would like to bring here is that it seems to
> be there have been some concerns about this idea when originally
> discussed [1]. It is not very clear to me if all the concerns are
> addressed or not. If one can summarize the concerns discussed and how
> the latest patch is able to address those then it will be great.

I have one concern about Clock-SI (sorry if this concern was already
discussed in the past). As far as I read the paper about Clock-SI, ISTM that
Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
due to the clock skew. Please see the following example;

1. Tx1 starts at the server A.

2. Tx1 writes some records at the server A.

3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
      the commit at the server A.
      This means that Tx1 is the local transaction, not distributed one.

4. Tx2 starts at the server B, i.e., the server B works as
      the coordinator node for Tx2.

5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
      due to clock skew) and uses 10 as SnapshotTime at the server B.

6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.

7. Tx2 doesn't need to wait due to clock skew because the imported
      SnapshotTime 10 is smaller than the local clock at the server A.

8. Tx2 fails to see the records written by Tx1 at the server A because
      Tx1's CommitTime 20 is larger than SnapshotTime 10.

So Tx1 was successfully committed before Tx2 starts. But, at the above example,
the subsequent transaction Tx2 fails to see the committed results.

The single PostgreSQL instance seems to guarantee that linearizability of
the transactions, but Clock-SI doesn't in the distributed env. Is this my
understanding right? Or am I missing something?

If my understanding is right, shouldn't we address that issue when using
Clock-SI? Or the patch has already addressed the issue?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Global snapshots

From
Masahiko Sawada
Date:
On Thu, 15 Oct 2020 at 01:41, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
>
>
> On 2020/09/17 15:56, Amit Kapila wrote:
> > On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> >>
> >>>> One alternative is to add only hooks into PostgreSQL core so that we can
> >>>> implement the global transaction management outside. This idea was
> >>>> discussed before as the title "eXtensible Transaction Manager API".
> >>>
> >>> Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...
> >>
> >> But it may be worth revisiting this idea if we cannot avoid the patent issue.
> >>
> >
> > It is not very clear what exactly we can do about the point raised by
> > Tsunakawa-San related to patent in this technology as I haven't seen
> > that discussed during other development but maybe we can try to study
> > a bit. One more thing I would like to bring here is that it seems to
> > be there have been some concerns about this idea when originally
> > discussed [1]. It is not very clear to me if all the concerns are
> > addressed or not. If one can summarize the concerns discussed and how
> > the latest patch is able to address those then it will be great.
>
> I have one concern about Clock-SI (sorry if this concern was already
> discussed in the past). As far as I read the paper about Clock-SI, ISTM that
> Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
> due to the clock skew. Please see the following example;
>
> 1. Tx1 starts at the server A.
>
> 2. Tx1 writes some records at the server A.
>
> 3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
>       the commit at the server A.
>       This means that Tx1 is the local transaction, not distributed one.
>
> 4. Tx2 starts at the server B, i.e., the server B works as
>       the coordinator node for Tx2.
>
> 5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
>       due to clock skew) and uses 10 as SnapshotTime at the server B.
>
> 6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.
>
> 7. Tx2 doesn't need to wait due to clock skew because the imported
>       SnapshotTime 10 is smaller than the local clock at the server A.
>
> 8. Tx2 fails to see the records written by Tx1 at the server A because
>       Tx1's CommitTime 20 is larger than SnapshotTime 10.
>
> So Tx1 was successfully committed before Tx2 starts. But, at the above example,
> the subsequent transaction Tx2 fails to see the committed results.
>
> The single PostgreSQL instance seems to guarantee that linearizability of
> the transactions, but Clock-SI doesn't in the distributed env. Is this my
> understanding right? Or am I missing something?
>
> If my understanding is right, shouldn't we address that issue when using
> Clock-SI? Or the patch has already addressed the issue?

As far as I read the paper, the above scenario can happen. I could
reproduce the above scenario with the patch. Moreover, a stale read
could happen even if Tx1 was initiated at server B (i.g., both
transactions started at the same server in sequence). In this case,
Tx1's commit timestamp would be 20 taken from server A's local clock
whereas Tx2's snapshot timestamp would be 10 same as the above case.
Therefore, in spite of both transactions were initiated at the same
server the linearizability is not provided.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Global snapshots

From
Fujii Masao
Date:

On 2020/10/23 11:58, Masahiko Sawada wrote:
> On Thu, 15 Oct 2020 at 01:41, Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>
>>
>>
>> On 2020/09/17 15:56, Amit Kapila wrote:
>>> On Thu, Sep 10, 2020 at 4:20 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>>>>
>>>>>> One alternative is to add only hooks into PostgreSQL core so that we can
>>>>>> implement the global transaction management outside. This idea was
>>>>>> discussed before as the title "eXtensible Transaction Manager API".
>>>>>
>>>>> Yeah, I read that discussion.  And I remember Robert Haas and Postgres Pro people said it's not good...
>>>>
>>>> But it may be worth revisiting this idea if we cannot avoid the patent issue.
>>>>
>>>
>>> It is not very clear what exactly we can do about the point raised by
>>> Tsunakawa-San related to patent in this technology as I haven't seen
>>> that discussed during other development but maybe we can try to study
>>> a bit. One more thing I would like to bring here is that it seems to
>>> be there have been some concerns about this idea when originally
>>> discussed [1]. It is not very clear to me if all the concerns are
>>> addressed or not. If one can summarize the concerns discussed and how
>>> the latest patch is able to address those then it will be great.
>>
>> I have one concern about Clock-SI (sorry if this concern was already
>> discussed in the past). As far as I read the paper about Clock-SI, ISTM that
>> Tx2 that starts after Tx1's commit can fail to see the results by Tx1,
>> due to the clock skew. Please see the following example;
>>
>> 1. Tx1 starts at the server A.
>>
>> 2. Tx1 writes some records at the server A.
>>
>> 3. Tx1 gets the local clock 20, uses 20 as CommitTime, then completes
>>        the commit at the server A.
>>        This means that Tx1 is the local transaction, not distributed one.
>>
>> 4. Tx2 starts at the server B, i.e., the server B works as
>>        the coordinator node for Tx2.
>>
>> 5. Tx2 gets the local clock 10 (i.e., it's delayed behind the server A
>>        due to clock skew) and uses 10 as SnapshotTime at the server B.
>>
>> 6. Tx2 starts the remote transaction at the server A with SnapshotTime 10.
>>
>> 7. Tx2 doesn't need to wait due to clock skew because the imported
>>        SnapshotTime 10 is smaller than the local clock at the server A.
>>
>> 8. Tx2 fails to see the records written by Tx1 at the server A because
>>        Tx1's CommitTime 20 is larger than SnapshotTime 10.
>>
>> So Tx1 was successfully committed before Tx2 starts. But, at the above example,
>> the subsequent transaction Tx2 fails to see the committed results.
>>
>> The single PostgreSQL instance seems to guarantee that linearizability of
>> the transactions, but Clock-SI doesn't in the distributed env. Is this my
>> understanding right? Or am I missing something?
>>
>> If my understanding is right, shouldn't we address that issue when using
>> Clock-SI? Or the patch has already addressed the issue?
> 
> As far as I read the paper, the above scenario can happen. I could
> reproduce the above scenario with the patch. Moreover, a stale read
> could happen even if Tx1 was initiated at server B (i.g., both
> transactions started at the same server in sequence). In this case,
> Tx1's commit timestamp would be 20 taken from server A's local clock
> whereas Tx2's snapshot timestamp would be 10 same as the above case.
> Therefore, in spite of both transactions were initiated at the same
> server the linearizability is not provided.

Yeah, so if we need to guarantee the transaction linearizability even
in distributed env (probably this is yes. Right?), using only Clock-SI
is not enough. We would need to implement something more
in addition to Clock-SI or adopt the different approach other than Clock-SI.
Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Fujii-san, Sawada-san, all,

From: Fujii Masao <masao.fujii@oss.nttdata.com>
> Yeah, so if we need to guarantee the transaction linearizability even
> in distributed env (probably this is yes. Right?), using only Clock-SI
> is not enough. We would need to implement something more
> in addition to Clock-SI or adopt the different approach other than Clock-SI.
> Thought?

Could you please try interpreting MVCO and see if we have any hope in this?  This doesn't fit in my small brain.  I'll
catchup with understanding this when I have time.
 

MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf


MVCO is a multiversion member of Commitment Ordering algorithms described below:

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co

Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering


Related patents are as follows.  The last one is MVCO.

US5504900A - Commitment ordering for guaranteeing serializability across distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900

US5504899A - Guaranteeing global serializability by applying commitment ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899

US5701480A - Distributed multi-version commitment ordering protocols for guaranteeing serializability during
transactionprocessing
 
https://patents.google.com/patent/US5701480A/en?oq=US5701480


Regards
Takayuki Tsunakawa



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello,


Fujii-san and I discussed how to move the scale-out development forward.  We are both worried that Clock-SI is
(highly?)likely to infringe the said Microsoft's patent.  So we agreed we are going to investigate the Clock-SI and the
patent,and if we have to conclude that we cannot embrace Clock-SI, we will explore other possibilities.
 

IMO, it seems that Clock-SI overlaps with the patent and we can't use it.  First, looking back how to interpret the
patentdocument, patent "claims" are what we should pay our greatest attention.  According to the following citation
fromthe IP guide by Software Freedom Law Center (SFLC) [1], software infringes a patent if it implements everything of
anyclaim, not all claims.
 


--------------------------------------------------
4.2 Patent Infringement
To prove that you5 infringe a patent, the patent holder must show that you make, use, offer to sell, or sell the
inventionas it is defined in at least one claim of the patent.
 

For software to infringe a patent, the software essentially must implement everything recited in one of the patent�fs
claims.It is crucial to recognize that infringement is based directly on the claims of the patent, and not on what is
statedor described in other parts of the patent document. 
 
--------------------------------------------------


And, Clock-SI implements at least claims 11 and 20 cited below.  It doesn't matter whether Clock-SI uses a physical
clockor logical one.
 


--------------------------------------------------
11. A method comprising:
receiving information relating to a distributed database transaction operating on data in data stores associated with
respectiveparticipating nodes associated with the distributed database transaction;
 
requesting commit time votes from the respective participating nodes, the commit time votes reflecting local clock
valuesof the respective participating nodes;
 
receiving the commit time votes from the respective participating nodes in response to the requesting;
computing a global commit timestamp for the distributed database transaction based at least in part on the commit time
votes,the global commit timestamp reflecting a maximum value of the commit time votes received from the respective
participatingnodes; and
 
synchronizing commitment of the distributed database transaction at the respective participating nodes to the global
committimestamp,
 
wherein at least the computing is performed by a computing device.

20. A method for managing a distributed database transaction, the method comprising:
receiving information relating to the distributed database transaction from a transaction coordinator associated with
thedistributed database transaction;
 
determining a commit time vote for the distributed database transaction based at least in part on a local clock;
communicating the commit time vote for the distributed database transaction to the transaction coordinator;
receiving a global commit timestamp from the transaction coordinator;
synchronizing commitment of the distributed database transaction to the global commit timestamp;
receiving a remote request from a requesting database node corresponding to the distributed database transaction;
creating a local transaction corresponding to the distributed database transaction;
compiling a list of database nodes involved in generating a result of the local transaction and access types utilized
byrespective database nodes in the list of database nodes; and
 
returning the list of database nodes and the access types to the requesting database node in response to the remote
request,
wherein at least the compiling is performed by a computing device.
--------------------------------------------------


My question is that the above claims appear to cover somewhat broad range.  I wonder if other patents or unpatented
technologiesoverlap with this kind of description.
 

Thoughts?


[1]
A Legal Issues Primer for Open Source and Free Software Projects
https://www.softwarefreedom.org/resources/2008/foss-primer.pdf

[2]
US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents
https://patents.google.com/patent/US8356007


Regards
Takayuki Tsunakawa


Re: Global snapshots

From
Fujii Masao
Date:

On 2021/01/01 12:14, tsunakawa.takay@fujitsu.com wrote:
> Hello,
> 
> 
> Fujii-san and I discussed how to move the scale-out development forward.  We are both worried that Clock-SI is
(highly?)likely to infringe the said Microsoft's patent.  So we agreed we are going to investigate the Clock-SI and the
patent,and if we have to conclude that we cannot embrace Clock-SI, we will explore other possibilities.
 

Yes.


> 
> IMO, it seems that Clock-SI overlaps with the patent and we can't use it.  First, looking back how to interpret the
patentdocument, patent "claims" are what we should pay our greatest attention.  According to the following citation
fromthe IP guide by Software Freedom Law Center (SFLC) [1], software infringes a patent if it implements everything of
anyclaim, not all claims.
 
> 
> 
> --------------------------------------------------
> 4.2 Patent Infringement
> To prove that you5 infringe a patent, the patent holder must show that you make, use, offer to sell, or sell the
inventionas it is defined in at least one claim of the patent.
 
> 
> For software to infringe a patent, the software essentially must implement everything recited in one of the patent�fs
claims.It is crucial to recognize that infringement is based directly on the claims of the patent, and not on what is
statedor described in other parts of the patent document.
 
> --------------------------------------------------
> 
> 
> And, Clock-SI implements at least claims 11 and 20 cited below.  It doesn't matter whether Clock-SI uses a physical
clockor logical one.
 

Thanks for sharing the result of your investigation!

Regarding at least claim 11, I reached the same conclusion. As far as
I understand correctly, Clock-SI actually does the method described
at the claim 11 when determing the commit time and doing the commit
on each node.

I don't intend to offend Clock-SI and any activities based on that. OTOH,
I'm now wondering if it's worth considering another approach for global
transaction support, while I'm still interested in Clock-SI technically.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
Hello, Andrey-san, all,


Based on the request at HighGo's sharding meeting, I'm re-sending the information on Commitment Ordering that could be
usedfor global visibility.  Their patents have already expired.
 



--------------------------------------------------
Have anyone examined the following Multiversion Commitment Ordering (MVCO)?  Although I haven't understood this yet, it
insiststhat no concurrency control information including timestamps needs to be exchanged among the cluster nodes.  I'd
appreciateit if someone could give an opinion.
 



Commitment Ordering Based Distributed Concurrency Control for Bridging Single and Multi Version Resources.
 Proceedings of the Third IEEE International Workshop on Research Issues on Data Engineering: Interoperability in
MultidatabaseSystems (RIDE-IMS), Vienna, Austria, pp. 189-198, April 1993. (also DEC-TR 853, July 1992)
 
https://ieeexplore.ieee.org/document/281924?arnumber=281924



The author of the above paper, Yoav Raz, seems to have had strong passion at least until 2011 about making people
believethe mightiness of Commitment Ordering (CO) for global serializability.  However, he complains (sadly) that
almostall researchers ignore his theory, as written in his following  site and wikipedia page for Commitment Ordering.
Doesanyone know why CO is ignored?
 


--------------------------------------------------
* Or, maybe we can use the following Commitment ordering that doesn't require the timestamp or any other information to
betransferred among the cluster nodes.  However, this seems to have to track the order of read and write operations
amongconcurrent transactions to ensure the correct commit order, so I'm not sure about the performance.  The MVCO paper
seemsto present the information we need, but I haven't understood it well yet (it's difficult.)  Could you anybody
kindlyinterpret this?
 



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



--------------------------------------------------
Could you please try interpreting MVCO and see if we have any hope in this?  This doesn't fit in my small brain.  I'll
catchup with understanding this when I have time.
 



MVCO - Technical report - IEEE RIDE-IMS 93 (PDF; revised version of DEC-TR 853)
https://sites.google.com/site/yoavraz2/MVCO-WDE.pdf



MVCO is a multiversion member of Commitment Ordering algorithms described below:



Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co



Commitment ordering - Wikipedia
https://en.wikipedia.org/wiki/Commitment_ordering



Related patents are as follows.  The last one is MVCO.



US5504900A - Commitment ordering for guaranteeing serializability across distributed transactions
https://patents.google.com/patent/US5504900A/en?oq=US5504900



US5504899A - Guaranteeing global serializability by applying commitment ordering selectively to global transactions
https://patents.google.com/patent/US5504899A/en?oq=US5504899



US5701480A - Distributed multi-version commitment ordering protocols for guaranteeing serializability during
transactionprocessing
 
https://patents.google.com/patent/US5701480A/en?oq=US5701480


Regards
Takayuki Tsunakawa


Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
Current state of the patch set rebased on master, 5aed6a1fc2.

It is development version. Here some problems with visibility still 
detected in two tests:
1. CSN Snapshot module - TAP test on time skew.
2. Clock SI implementation - TAP test on emulation of bank transaction.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

RE: Global snapshots

From
"tsunakawa.takay@fujitsu.com"
Date:
From: Andrey V. Lepikhov <a.lepikhov@postgrespro.ru>
> Current state of the patch set rebased on master, 5aed6a1fc2.
> 
> It is development version. Here some problems with visibility still detected in
> two tests:
> 1. CSN Snapshot module - TAP test on time skew.
> 2. Clock SI implementation - TAP test on emulation of bank transaction.

I'm sorry to be late to respond.  Thank you for the update.

As discussed at the HighGo meeting, what do you think we should do about this patch set, now that we agreed that
Clock-SIis covered by Microsoft's patent?  I'd appreciate it if you could share some idea to change part of the
algorithmand circumvent the patent.
 

Otherwise, why don't we discuss alternatives, such as the Commitment Ordering?

I have a hunch that YugabyteDB's method seems promising, which I wrote in the following wiki.  Of course, we should
makeefforts to see if it's patented before diving deeper into the design or implementation.
 

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design


Regards
Takayuki Tsunakawa



Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
Next version of CSN implementation in snapshots to achieve a proper 
snapshot isolation in the case of a cross-instance distributed transaction.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment

Re: Global snapshots

From
"Andrey V. Lepikhov"
Date:
Patch in the previous letter is full of faulties. Please, use new version.
Also, here we fixed the problem with loosing CSN value in a parallel 
worker (TAP test 003_parallel_safe.pl). Thanks for a.pyhalov for the 
problem detection and a bugfix.

-- 
regards,
Andrey Lepikhov
Postgres Professional

Attachment