Thread: [ADMIN] avoiding split brain with repmgr

[ADMIN] avoiding split brain with repmgr

From

Aleksander Kamenik

Date:

14 August 2017, 17:03:18

Hi!

In a cluster set up with postgres 9.6, streaming replication and
repmgr I'm struggling to find a good/simple solution for avoiding
split brain.

The current theoretical setup consists of 4 nodes across two data
centers. The master node is setup with 1 of 3 synchronous replication.
That is it waits for at least one other node to COMMIT as well.
repmgrd is installed on every node.

The clients will use postgresql JDBC with targetServerType=master so
they connect only to the master server in a list of four hosts.

The split brain scenario I forsee is when the master node locks up or
is isolated for a while and comes back online after repmgrd on other
nodes have elected a new master.

As the original master node has a requirement of one synced
replication node and the remaining two standbys are streaming from the
new master it will fortunately not start writing a separate timeline,
but will still serve dated read only queries. For writes it will
accept connections which hang. The repmgrd instance on the original
master sees no problem either so does nothing.

Ideally though this instance should be shut down as it has no slaves
attached and the status on other nodes indicates this master node is
failed.

Any suggestions? I'm trying to keep the setup simple without a central
pgbouncer/pgpool. Any simple way to avoid a central connection point
or custom monitoring script that looks for exactly this issue?

Also, do you see any other potential pitfalls in this setup?

Thanks for thinking this through,

Aleksander

--
Aleksander Kamenik

Re: [ADMIN] avoiding split brain with repmgr

From

Aleksander Kamenik

Date:

15 August 2017, 08:57:12

I finally found this document NOT referenced from the main README file
in the repmgr repo.

https://github.com/2ndQuadrant/repmgr/blob/master/docs/repmgrd-node-fencing.md

I guess the default solution is pgbouncer

Any simpler solutions for this tricky problem?

Regards,

Aleksander

On Mon, Aug 14, 2017 at 5:03 PM, Aleksander Kamenik
<aleksander.kamenik@gmail.com> wrote:
> Hi!
>
> In a cluster set up with postgres 9.6, streaming replication and
> repmgr I'm struggling to find a good/simple solution for avoiding
> split brain.
>
> The current theoretical setup consists of 4 nodes across two data
> centers. The master node is setup with 1 of 3 synchronous replication.
> That is it waits for at least one other node to COMMIT as well.
> repmgrd is installed on every node.
>
> The clients will use postgresql JDBC with targetServerType=master so
> they connect only to the master server in a list of four hosts.
>
> The split brain scenario I forsee is when the master node locks up or
> is isolated for a while and comes back online after repmgrd on other
> nodes have elected a new master.
>
> As the original master node has a requirement of one synced
> replication node and the remaining two standbys are streaming from the
> new master it will fortunately not start writing a separate timeline,
> but will still serve dated read only queries. For writes it will
> accept connections which hang. The repmgrd instance on the original
> master sees no problem either so does nothing.
>
> Ideally though this instance should be shut down as it has no slaves
> attached and the status on other nodes indicates this master node is
> failed.
>
> Any suggestions? I'm trying to keep the setup simple without a central
> pgbouncer/pgpool. Any simple way to avoid a central connection point
> or custom monitoring script that looks for exactly this issue?
>
> Also, do you see any other potential pitfalls in this setup?
>
> Thanks for thinking this through,
>
> Aleksander
>
> --
> Aleksander Kamenik



--
Aleksander Kamenik

Re: [ADMIN] avoiding split brain with repmgr

From

Martin Goodson

Date:

15 August 2017, 10:28:24

On 15/08/2017 06:57, Aleksander Kamenik wrote:
> I finally found this document NOT referenced from the main README file
> in the repmgr repo.
>
> https://github.com/2ndQuadrant/repmgr/blob/master/docs/repmgrd-node-fencing.md
>
> I guess the default solution is pgbouncer
>
> Any simpler solutions for this tricky problem?
>
> Regards,
>
> Aleksander

This is interesting to me, because I'm faced with a similar problem and
I'm not 100% sold on that fencing technique. If I've misunderstood
things please do yell at me (I welcome it :) ) but ...

The issue I have with that suggested mechanism, and I'd love to hear
suggestions on how to get around it because maybe I missed something
**horribly obvious**, is that repmgr doesn't seem to have a proper
stonith mechanism per se. It's all well and good repmgr being able to
send a message/command to something like pgbouncer, pgpool or whatever
saying 'Hey, server B is the master now so pay no attention to server A'
but that depends on those messages being received. What if, for reasons,
they're not?

Consider the following scenario:

You've two data centres, DC 1 and DC2. On each, you've got two or three
PostgreSQL nodes, a pgbouncer node, and a few application servers. The
Master is on one of the nodes in DC1.

At 3AM there's a power failure in DC1. It only lasts a few minutes, but
it's enough for repmgrd to decide to trigger failover to DC2.

One of the nodes on DC2 becomes master and the other standby(s) on DC2
start to follow it. As per the fencing method above, the repmgrd
promotion triggers a custom script to send instructions to the
pgbouncers in DC1 and DC2 to update their configurations to connect to
the new master in DC2. The pgbouncer in DC2 complies, the pgbouncr in
DC1 doesn't (because it's still down).

A few moments later, after the failover, power is restored/UPS kicks
in/the Ops team puts a coin in the meter. DC1 comes back up.

The master node in DC1 still believes it is the master. The pgbouncer
never got the message to update itself to follow the new master in DC2,
so it is still passing connections through to DC1. The other standby
nodes in DC1 never got the repmgrd command to follow a new master
either, as they were down. So they're still following the DC1 master.

You now have a master in DC1, with standby nodes following it, and a
pgbouncer passing along sessions from the application servers to the DC1
master. You also have a master in DC2, with standby nodes following it,
and a pgbouncer passing along sessions from the application servers to
the DC2 master.

Because DC1 was down at the time that repmgrd was sending along the 'Pay
no attention to DC1 master, update yourself to talk to DC2 instead'
message to the pgbouncers, surely you've now got a split brain scenario?

In the scenario I had we couldn't use a vip (for 'reasons' according to
our unix team :) ) so suggestions included JDBC connect strings with
multiple servers, load balancers, etc. But they'd still see two masters
at that point.

Without a proper mechanism for the 'old' master to be shut down to avoid
a split brain when it comes back up, everything seems to rely upon
repmgrd being able to successfully pass along the 'Pay no attention to
the old node' commands/messages. But what if it can't do that because
some of the servers were unable to receive the message/command?

Sure, maybe a DBA gets a fast page and is able to remote in and shut the
old master down mere minutes after it comes back up (in the ideal world)
but that's still a potential several minutes with a split brain, and
nothing (internal to the cluster, at least) preventing it.

Or am I missing something *really* obvious? If this is a possibility,
and I've not horribly misunderstood things, how can this scenario be
worked around? It seems to be a potential problem with the fencing
method suggested.

Regards,

M.
--
Martin Goodson

"Have you thought up some clever plan, Doctor?"
"Yes, Jamie, I believe I have."
"What're you going to do?"
"Bung a rock at it."

Re: [ADMIN] avoiding split brain with repmgr

From

Marc Mamin

Date:

15 August 2017, 10:31:25


>I finally found this document NOT referenced from the main README file in the repmgr repo.
>
>https://github.com/2ndQuadrant/repmgr/blob/master/docs/repmgrd-node-fencing.md
>
>I guess the default solution is pgbouncer

Hello,
I'm not sure that any solution can be considered as standard, but we did implement such a solution with pgbouncer.
The script in the linked reference seems somewhat dangerous to me as it first reconfigure pgbouncer and then promote.
This is not safe if the postgres nodes were to suffer a brain split.

In our case we used following sequence:
- stop pgbouncer
- promote
- reconfigure and restart pgbouncer

This same sequence can be used for a manual switchover.

regards,

Marc Mamin


>
>Any simpler solutions for this tricky problem?
>
>Regards,
>
>Aleksander
>
>On Mon, Aug 14, 2017 at 5:03 PM, Aleksander Kamenik <aleksander.kamenik@gmail.com> wrote:
>> Hi!
>>
>> In a cluster set up with postgres 9.6, streaming replication and 
>> repmgr I'm struggling to find a good/simple solution for avoiding 
>> split brain.
>>
>> The current theoretical setup consists of 4 nodes across two data 
>> centers. The master node is setup with 1 of 3 synchronous replication.
>> That is it waits for at least one other node to COMMIT as well.
>> repmgrd is installed on every node.
>>
>> The clients will use postgresql JDBC with targetServerType=master so 
>> they connect only to the master server in a list of four hosts.
>>
>> The split brain scenario I forsee is when the master node locks up or 
>> is isolated for a while and comes back online after repmgrd on other 
>> nodes have elected a new master.
>>
>> As the original master node has a requirement of one synced 
>> replication node and the remaining two standbys are streaming from the 
>> new master it will fortunately not start writing a separate timeline, 
>> but will still serve dated read only queries. For writes it will 
>> accept connections which hang. The repmgrd instance on the original 
>> master sees no problem either so does nothing.
>>
>> Ideally though this instance should be shut down as it has no slaves 
>> attached and the status on other nodes indicates this master node is 
>> failed.
>>
>> Any suggestions? I'm trying to keep the setup simple without a central 
>> pgbouncer/pgpool. Any simple way to avoid a central connection point 
>> or custom monitoring script that looks for exactly this issue?
>>
>> Also, do you see any other potential pitfalls in this setup?
>>
>> Thanks for thinking this through,
>>
>> Aleksander
>>
>> --
>> Aleksander Kamenik
>
>
>
>--
>Aleksander Kamenik
>
>
>-- 
>Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-admin
>

Re: [ADMIN] avoiding split brain with repmgr

From

Aleksander Kamenik

Date:

15 August 2017, 15:40:39

Hi!

Thanks for the replies.

In my case I'm avoiding manual DC failover by setting zero priority to
the DC2 nodes (node3 and node4). DC1 (node1 and node2) failure is
considered a major event that requires manual intervention anyhow. I
still have repmgrd running on all nodes though. Anyhow, the split
brain issue is still there for node1/node2.

Martin Goodson is right that pgbouncer reconfig is not a magic bullet.
If one of the pgbouncer instances is also affected (or part of
relevant network), be it installed on the client machine or separately
this solution will still produce a split brain scenario.

I've been looking at STONITH like solution. In my case it's VMWare
environment. When node2 is being promoted it sends a signal to vCenter
to kill node1. This could work though there are security concerns.
Should a VM be able to control another VM via vCenter? What if that
call fails (is it sync or async/queued)? Do we even know if it fails
in vCenter? If it's a synced call should promotion be canceled if it
returns an error?

I haven't looked hard enough yet but hope to find a way for vCenter to
monitor for a file on node2 which would trigger the shutdown on node1.
The file would be set when promoting node2.

I've also thought of monitoring the system state from node1. When
node1 is back it could detect that the rest of the cluster has elected
a new master (by connecting to slaves and checking repmgr tables) and
shut down. However this still leaves a short window where node1 will
accept connections. And if it's a network issue that splits the
clients as well we'd have split brain immediately. So it's a no go.
You can only elect a new master when the old one is definitely killed.

So in case a network split occurs that leaves some/all clients
connected to the original master and say node3 from DC2 too so COMMIT
works fine. During the timeout before node2 kills node1 some clients
will still write to node1. This data doesn't make it to node2. Now
during promotion it will be determined that node3 has the latest data
[1]. What will happen if it's priority is zero though? I will need to
test all this but it looks like I'll have to allow automatic DC
failover to occur and just set some low priority for nodes 3 and 4.
Maybe I'll need a witness server as well then.

[1] https://github.com/2ndQuadrant/repmgr/blob/master/docs/repmgrd-failover-mechanism.md

On Tue, Aug 15, 2017 at 10:31 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
>
>
>>I finally found this document NOT referenced from the main README file in the repmgr repo.
>>
>>https://github.com/2ndQuadrant/repmgr/blob/master/docs/repmgrd-node-fencing.md
>>
>>I guess the default solution is pgbouncer
>
> Hello,
> I'm not sure that any solution can be considered as standard, but we did implement such a solution with pgbouncer.
> The script in the linked reference seems somewhat dangerous to me as it first reconfigure pgbouncer and then promote.
> This is not safe if the postgres nodes were to suffer a brain split.
>
> In our case we used following sequence:
> - stop pgbouncer
> - promote
> - reconfigure and restart pgbouncer
>
> This same sequence can be used for a manual switchover.
>
> regards,
>
> Marc Mamin
>
>
>>
>>Any simpler solutions for this tricky problem?
>>
>>Regards,
>>
>>Aleksander
>>
>>On Mon, Aug 14, 2017 at 5:03 PM, Aleksander Kamenik <aleksander.kamenik@gmail.com> wrote:
>>> Hi!
>>>
>>> In a cluster set up with postgres 9.6, streaming replication and
>>> repmgr I'm struggling to find a good/simple solution for avoiding
>>> split brain.
>>>
>>> The current theoretical setup consists of 4 nodes across two data
>>> centers. The master node is setup with 1 of 3 synchronous replication.
>>> That is it waits for at least one other node to COMMIT as well.
>>> repmgrd is installed on every node.
>>>
>>> The clients will use postgresql JDBC with targetServerType=master so
>>> they connect only to the master server in a list of four hosts.
>>>
>>> The split brain scenario I forsee is when the master node locks up or
>>> is isolated for a while and comes back online after repmgrd on other
>>> nodes have elected a new master.
>>>
>>> As the original master node has a requirement of one synced
>>> replication node and the remaining two standbys are streaming from the
>>> new master it will fortunately not start writing a separate timeline,
>>> but will still serve dated read only queries. For writes it will
>>> accept connections which hang. The repmgrd instance on the original
>>> master sees no problem either so does nothing.
>>>
>>> Ideally though this instance should be shut down as it has no slaves
>>> attached and the status on other nodes indicates this master node is
>>> failed.
>>>
>>> Any suggestions? I'm trying to keep the setup simple without a central
>>> pgbouncer/pgpool. Any simple way to avoid a central connection point
>>> or custom monitoring script that looks for exactly this issue?
>>>
>>> Also, do you see any other potential pitfalls in this setup?
>>>
>>> Thanks for thinking this through,
>>>
>>> Aleksander
>>>
>>> --
>>> Aleksander Kamenik
>>
>>
>>
>>--
>>Aleksander Kamenik
>>
>>
>>--
>>Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
>>To make changes to your subscription:
>>http://www.postgresql.org/mailpref/pgsql-admin
>>

--
Aleksander Kamenik

Re: [ADMIN] avoiding split brain with repmgr

From

Phil Frost

Date:

15 August 2017, 17:56:27

I recommend looking into Pacemaker if avoiding split-brain is a hard requirement. A proper solution requires:

- A mechanism to fence failed nodes, since "failed" really means "unknown". Without fencing there's a significant probability of split-brain. Pacemaker has a meatware fencing plugin which can be used on its own, or as a backup to automated fencing mechanisms.

- At least three nodes to establish quorum. Otherwise there's a risk that each half of a partition will try to fence the other, thinking the other half has failed.

- A non-trivial consensus protocol, one that's been mathematically studied and reviewed. Like encryption, this is a notoriously difficult problem and not the place for casually designed solutions.

http://clusterlabs.org/

and for cross-datacenter failover: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch15.html

https://aphyr.com/tags/jepsen is a good read on database consistency generally.