Re: Multi-Master Logical Replication - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Multi-Master Logical Replication
Date
Msg-id CAA4eK1JBp-khmVN=gHyojFmem3zuj-iU4O0ENmsZz1yKeEbNtw@mail.gmail.com
Whole thread Raw
In response to Re: Multi-Master Logical Replication  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Wed, Jun 1, 2022 at 7:33 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Jun  1, 2022 at 10:27:27AM +0530, Amit Kapila wrote:
> > On Tue, May 31, 2022 at 7:36 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > Uh, thinking some more, why would anyone set things up this way ---
> > > having part of a table being primary on one server and a different part
> > > of the table be a subscriber.  Seems it would be simpler and safer to
> > > create two child tables and have one be primary on only one server.
> > > Users can access both tables using the parent.
> >
> > Yes, users can choose to do that way but still, to keep the nodes in
> > sync and continuity of operations, it will be very difficult to manage
> > the operations without the LRG APIs. Let us consider a simple two-node
> > example where on each node there is Table T that has partitions P1 and
> > P2. As far as I can understand, one needs to have the below kind of
> > set-up to allow local operations on geographically distributed nodes.
> >
> > Node-1:
> > node1 writes to P1
> > node1 publishes P1
> > node2 subscribes to P1 of node1
> >
> > Node-2:
> > node2 writes to P2
> > node2 publishes P2
> > node1 subscribes to P2 on node2
>
> Yes, that is how you would set it up.
>
> > In this setup, we need to publish individual partitions, otherwise, we
> > will face the loop problem where the data sent by node-1 to node-2 via
> > logical replication will again come back to it causing problems like
> > constraints violations, duplicate data, etc. There could be other ways
> > to do this set up with current logical replication commands (for ex.
> > publishing via root table) but that would require ways to avoid loops
> > and could have other challenges.
>
> Right, individual paritions.
>
> > Now, in such a setup/scheme, consider a scenario (scenario-1), where
> > node-2 went off (either it crashes, went out of network, just died,
> > etc.) and comes up after some time. Now, one can either make the
> > node-2 available by fixing the problem it has or can promote standby
> > in that location (if any) to become master, both might require some
> > time. In the meantime to continue the operations (which provides a
> > seamless experience to users), users will be connected to node-1 to
> > perform the required write operations. Now, to achieve this without
> > LRG APIs, it will be quite complex for users to keep the data in sync.
> > One needs to perform various steps to get the partition P2 data that
> > went to node-1 till the time node-2 was not available. On node-1, it
> > has to publish P2 changes for the time node-2 becomes available with
> > the help of Create/Drop Publication APIs. And when node-2 comes back,
> > it has to create a subscription for the above publication pub-2 to get
> > that data, ensure both the nodes and in sync, and then allow
> > operations on node-2.
>
> Well, you are going to need to modify the app so it knows it can write
> to both partitions on failover anyway.
>

I am not sure if this point is clear to me. From what I can understand
there are two possibilities for the app in this case and both seem to
be problematic.

(a) The app can be taught to write to the P2 partition in node-1 till
the time node-2 is not available. If so, how will we get the partition
P2 data that went to node-1 till the time node-2 was unavailable? If
we don't get the data to node-2 then the operations on node-2 (once it
comes back) can return incorrect results. Also, we need to ensure all
the data for P2 that went to node-1 should be replicated to all other
nodes in the system and for that also we need to create new
subscriptions pointing to node-1. It is easier to think of doing this
for physical replication where after failover the old master node can
start following the new node and the app just need to be taught to
write to the new master node. I can't see how we can achieve that by
current logical replication APIs (apart from doing the complex steps
shared by me). One of the purposes of these new LRG APIs is to ensure
that users don't need to follow those complex steps after failover.

(b) The other possibility is that the app is responsible to ensure
that the same data is written on both node-1 and node-2 for the time
one of those is not available. For that app needs to store the data at
someplace for the time one of the nodes is unavailable and then write
it once the other node becomes available? Also, it won't be practical
when there are more partitions (say 10 or more) as all the partitions
data needs to be present on each node. I think it is the
responsibility of the database to keep the data in sync among nodes
when one or more of the nodes are not available.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: compiler warnings with gcc 4.8 and -Og
Next
From: KAWAMOTO Masaya
Date:
Subject: Re: Proposal: add a debug message about using geqo