Thread: increasing HA

increasing HA

From
Thomas Poty
Date:
Hello,

I am looking after some advice about solution allowing to increase High availability?

Here is a bit of context :

I have an Master-Slave architecture
 - 1 master
 - 2 asynchronous slaves using replication slot
 - backup is made with barman using replication slot
 - Wal archiving is done towards barman server

I think 3 axes could be improved ( I am totaly novice with these):

- using of a proxy
 I found HAproxy.
 Could you advice any others solutions to explore or share your experience?

- using an automatick failover
 I found PAF
 Could you advice any others solutions to explore or share your experience?

- using a tool for fencing a failing node
Ideally, I aimagine to disable network traffic in I/O to prevent client connecting and exchange between backup server failling server + on postgesql server disable automatic restart of the service.
Could you share you experience about it?

- Maybe an other axe to explore ?

Thank you

Thomas

RE: increasing HA

From
ROS Didier
Date:

Hi

               I have made a lot of PostgreSQL High Availability tests (more than 20 by solution) and the two following products respond well to the need :

(1)    Repmgr (2ndQuadrant)

(2)    Pglookout (aiven)

 

About PAF, the product is hard to install and set up . It need a linux cluster and a system engineers team to use it.

 

Best Regards

 

cid:image002.png@01D14E0E.8515EB90


Didier ROS

Expertise SGBD

DS IT/IT DMA/Solutions Groupe EDF/Expertise Applicative - SGBD
Nanterre Picasso - E2 565D (aile nord-est)
32 Avenue Pablo Picasso
92000 Nanterre

 

De : thomas.poty@gmail.com [mailto:thomas.poty@gmail.com]
Envoyé : mardi 4 septembre 2018 16:59
À : pgsql-general@lists.postgresql.org
Objet : increasing HA

 

Hello,

 

I am looking after some advice about solution allowing to increase High availability?

 

Here is a bit of context :

 

I have an Master-Slave architecture

 - 1 master

 - 2 asynchronous slaves using replication slot

 - backup is made with barman using replication slot

 - Wal archiving is done towards barman server

 

I think 3 axes could be improved ( I am totaly novice with these):

 

- using of a proxy

 I found HAproxy.

 Could you advice any others solutions to explore or share your experience?

- using an automatick failover

 I found PAF
 Could you advice any others solutions to explore or share your experience?

- using a tool for fencing a failing node

Ideally, I aimagine to disable network traffic in I/O to prevent client connecting and exchange between backup server failling server + on postgesql server disable automatic restart of the service.

Could you share you experience about it?

 

- Maybe an other axe to explore ?

 

Thank you

 

Thomas


Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus.
____________________________________________________

This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or virus-free.

Attachment

Re: increasing HA

From
"Joshua D. Drake"
Date:
On 09/04/2018 07:58 AM, Thomas Poty wrote:
>
> - using an automatick failover
>  I found PAF
>  Could you advice any others solutions to explore or share your 
> experience?

LinuxHA or Patroni are the most common we run into.

JD--

Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
*****     Unless otherwise stated, opinions are my own.   *****



Re: increasing HA

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
Hi all,

On Tue, 4 Sep 2018 15:09:51 +0000
ROS Didier <didier.ros@edf.fr> wrote:

> Hi
>                I have made a lot of PostgreSQL High Availability tests (more
> than 20 by solution) and the two following products respond well to the need :
> 
> (1)    Repmgr (2ndQuadrant)
> 
> (2)    Pglookout (aiven)

Both solutions use a simple and naive implementation, which makes them easy to
use and admin. However, it gives the responsibilities to the admin to deal with
fencing, which is a mandatory piece in almost all kind of DB cluster if you
want to cover most of the failure cases and avoid split brain.

So yes, they are simple, because complexity is left to the admin skills. It
kind of require you rewrote and test yourself part of the fencing stack of
Pacemaker. Good luck.

And I'm not speaking about watchdog here, which I just fail to imagine how the
admin could implement it himself.

Just consider how messy it is to deal with "logical fencing" when considering
doing it with pgbouncer.

In short: if you are ready to spend many dev/admin hours to build a safe HA
cluster for your DB and set strict requirements, those are fine.

> About PAF, the product is hard to install and set up . It need a linux
> cluster and a system engineers team to use it.

Indeed, Pacemaker has a steep learning cuve and documentation still requires
some improvement. But HA is not an easy subject. Just look at RHEL or Suse
requirements before their team accept to support your DB cluster (spoiler:
fencing).

Whatever solution you pick, you must **know and document** its limitations and
risks.


Re: increasing HA

From
Thomas Poty
Date:
Hi Jehan-Guillaume,

Thanks for your opinion.

At first glance, i may use for automatic failover PAF, a proxy HAproxy and for fencincg, i am a bit disappointed, i don't know what to do/use

How about you, do you have any preference about tools/solutions to use ?

now, I am aware that i will have to check and document limitation/risk...

Le mer. 5 sept. 2018 à 11:38, Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr> a écrit :
Hi all,

On Tue, 4 Sep 2018 15:09:51 +0000
ROS Didier <didier.ros@edf.fr> wrote:

> Hi
>                I have made a lot of PostgreSQL High Availability tests (more
> than 20 by solution) and the two following products respond well to the need :
>
> (1)    Repmgr (2ndQuadrant)
>
> (2)    Pglookout (aiven)

Both solutions use a simple and naive implementation, which makes them easy to
use and admin. However, it gives the responsibilities to the admin to deal with
fencing, which is a mandatory piece in almost all kind of DB cluster if you
want to cover most of the failure cases and avoid split brain.

So yes, they are simple, because complexity is left to the admin skills. It
kind of require you rewrote and test yourself part of the fencing stack of
Pacemaker. Good luck.

And I'm not speaking about watchdog here, which I just fail to imagine how the
admin could implement it himself.

Just consider how messy it is to deal with "logical fencing" when considering
doing it with pgbouncer.

In short: if you are ready to spend many dev/admin hours to build a safe HA
cluster for your DB and set strict requirements, those are fine.

> About PAF, the product is hard to install and set up . It need a linux
> cluster and a system engineers team to use it.

Indeed, Pacemaker has a steep learning cuve and documentation still requires
some improvement. But HA is not an easy subject. Just look at RHEL or Suse
requirements before their team accept to support your DB cluster (spoiler:
fencing).

Whatever solution you pick, you must **know and document** its limitations and
risks.

Re: increasing HA

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Wed, 5 Sep 2018 13:23:41 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> Hi Jehan-Guillaume,

Hello,

> Thanks for your opinion.
> 
> At first glance, i may use for automatic failover PAF, 

In fact, PAF does not support slots. So it is not a good candidate if slot are
a requirement.

> a proxy HAproxy and
> for fencincg, i am a bit disappointed, i don't know what to do/use

Depend on your hardware or your virtualization technology.

> How about you, do you have any preference about tools/solutions to use ?

If you want a simple and well community adopted solution, pick Patroni. It deals
with slots, rely on etcd or zookeeper, fit nicely with haproxy, deal with
watchdog to keep itself under monitor. However, it lacks of fencing and its
callback are asynchronous. You would have to take special care of your
network and master connectivity upon primary failure.

If you want something able to keep multiple services avaliable (PostgreSQL, vIP,
storage, pgBouncer, apache, whatever...), deal with dependencies, locations,
constraints, rules etc, pick Pacemaker (and a larger coffee machine). I would
(obviously) recommend PAF as resource agent for PgSQL, but you would have to
design your cluster without slots :/

++


Re: increasing HA

From
Thomas Poty
Date:
> In fact, PAF does not support slots. So it is not a good candidate if slot are
> a requirement.
Effectively slots are a requirement we prefer to keep

> > a proxy HAproxy and
> > for fencincg, i am a bit disappointed, i don't know what to do/use

> Depend on your hardware or your virtualization technology.
Our production cluster (master and slave) runs on LXC container. Each LXC container runs on a HPE Blade Server. The storage is on a SAN 3PAR array.
Any advice ?

> > How about you, do you have any preference about tools/solutions to use ?

> If you want a simple and well community adopted solution, pick Patroni. It deals
> with slots, rely on etcd or zookeeper, fit nicely with haproxy, deal with
> watchdog to keep itself under monitor. However, it lacks of fencing and its
> callback are asynchronous. You would have to take special care of your
> network and master connectivity upon primary failure.

I am looking after some infrmation about this solution on their doc/irc...
Your opinion about it is important for me by knowing  you maintain PAF :-)

> If you want something able to keep multiple services avaliable (PostgreSQL, vIP,
> storage, pgBouncer, apache, whatever...), deal with dependencies, locations,
> constraints, rules etc, pick Pacemaker (and a larger coffee machine). I would
> (obviously) recommend PAF as resource agent for PgSQL, but you would have to
> design your cluster without slots :/

many thanks

Le mer. 5 sept. 2018 à 14:15, Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr> a écrit :
On Wed, 5 Sep 2018 13:23:41 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> Hi Jehan-Guillaume,

Hello,

> Thanks for your opinion.
>
> At first glance, i may use for automatic failover PAF,

In fact, PAF does not support slots. So it is not a good candidate if slot are
a requirement.

> a proxy HAproxy and
> for fencincg, i am a bit disappointed, i don't know what to do/use

Depend on your hardware or your virtualization technology.

> How about you, do you have any preference about tools/solutions to use ?

If you want a simple and well community adopted solution, pick Patroni. It deals
with slots, rely on etcd or zookeeper, fit nicely with haproxy, deal with
watchdog to keep itself under monitor. However, it lacks of fencing and its
callback are asynchronous. You would have to take special care of your
network and master connectivity upon primary failure.

If you want something able to keep multiple services avaliable (PostgreSQL, vIP,
storage, pgBouncer, apache, whatever...), deal with dependencies, locations,
constraints, rules etc, pick Pacemaker (and a larger coffee machine). I would
(obviously) recommend PAF as resource agent for PgSQL, but you would have to
design your cluster without slots :/

++

Re: increasing HA

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Wed, 5 Sep 2018 15:06:21 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> > In fact, PAF does not support slots. So it is not a good candidate if  
> > slot are a requirement.  
> Effectively slots are a requirement we prefer to keep

OK, so either patch PAF yourself (not recommended) or choose something
else. Note that two other ways are working with Pacemaker:
  * the pgsql resource agent (see FAQ of PAF)
  * a shared disk architecture (no pgsql replication)

> > > a proxy HAproxy and
> > > for fencincg, i am a bit disappointed, i don't know what to do/use  
> > Depend on your hardware or your virtualization technology.  
> Our production cluster (master and slave) runs on LXC container. Each LXC
> container runs on a HPE Blade Server. The storage is on a SAN 3PAR array.
> Any advice ?

I suppose you could find fencing agents for:

* the blade itself, but it would fence all the container running on it
* the access to the SAN from the failing container

I don't know if fencing agent exists for a container itself. Note that I'm not
familiar with the container world, I lack a lot of knowledge on this
technology.

++


Re: increasing HA

From
Thomas Poty
Date:
> OK, so either patch PAF yourself (not recommended) or choose something
> else. Note that two other ways are working with Pacemaker:
>   * the pgsql resource agent (see FAQ of PAF)
>   * a shared disk architecture (no pgsql replication)
Probably, i will be interested by the solution "patroni-etcd-haproxy" you suggested

> I suppose you could find fencing agents for:

> * the blade itself, but it would fence all the container running on it
> * the access to the SAN from the failing container

> I don't know if fencing agent exists for a container itself. Note that I'm not
> familiar with the container world, I lack a lot of knowledge on this
> technology.

I am not familiar with this too. I heard that for the first time few days ago by reading how to improve HA ... :-)

Many thanks for your opinions/advices

Le mer. 5 sept. 2018 à 15:44, Jehan-Guillaume (ioguix) de Rorthais <ioguix@free.fr> a écrit :
On Wed, 5 Sep 2018 15:06:21 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> > In fact, PAF does not support slots. So it is not a good candidate if 
> > slot are a requirement. 
> Effectively slots are a requirement we prefer to keep

OK, so either patch PAF yourself (not recommended) or choose something
else. Note that two other ways are working with Pacemaker:
  * the pgsql resource agent (see FAQ of PAF)
  * a shared disk architecture (no pgsql replication)

> > > a proxy HAproxy and
> > > for fencincg, i am a bit disappointed, i don't know what to do/use 
> > Depend on your hardware or your virtualization technology. 
> Our production cluster (master and slave) runs on LXC container. Each LXC
> container runs on a HPE Blade Server. The storage is on a SAN 3PAR array.
> Any advice ?

I suppose you could find fencing agents for:

* the blade itself, but it would fence all the container running on it
* the access to the SAN from the failing container

I don't know if fencing agent exists for a container itself. Note that I'm not
familiar with the container world, I lack a lot of knowledge on this
technology.

++

Re: increasing HA

From
Dmitri Maziuk
Date:
On Wed, 5 Sep 2018 13:23:41 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> At first glance, i may use for automatic failover PAF, a proxy HAproxy and
> for fencincg, i am a bit disappointed, i don't know what to do/use

How many nodes do you want run? Keep in mind that with only 2 nodes, fencing off the right one is by definition an
undecidableproblem. That's one of the features of the current linux HA stack that makes one want to go BSD/CARP.
 

And if you want to run fully distributed, you might want to look at redis in the first place.

-- 
Dmitri Maziuk <dmaziuk@bmrb.wisc.edu>


Re: increasing HA

From
Thomas Poty
Date:
> How many nodes do you want run? Keep in mind that with only 2 nodes, fencing off the right one is by definition an undecidable problem. That's one of the features of the current linux HA stack that makes one want to go BSD/CARP.
>
> And if you want to run fully distributed, you might want to look at redis in the first place.
We have one master and 2 slaves so 3 nodes

Le mer. 5 sept. 2018 à 16:42, Dmitri Maziuk <dmaziuk@bmrb.wisc.edu> a écrit :
On Wed, 5 Sep 2018 13:23:41 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> At first glance, i may use for automatic failover PAF, a proxy HAproxy and
> for fencincg, i am a bit disappointed, i don't know what to do/use

How many nodes do you want run? Keep in mind that with only 2 nodes, fencing off the right one is by definition an undecidable problem. That's one of the features of the current linux HA stack that makes one want to go BSD/CARP.

And if you want to run fully distributed, you might want to look at redis in the first place.

--
Dmitri Maziuk <dmaziuk@bmrb.wisc.edu>

Re: increasing HA

From
Dmitri Maziuk
Date:
On Wed, 5 Sep 2018 17:45:05 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> We have one master and 2 slaves so 3 nodes

So what is the "HA" that you're trying to "increase"? Are you adding a 3rd slave? A 2nd master? A hot standby? All of
theabove? 
 

-- 
Dmitri Maziuk <dmaziuk@bmrb.wisc.edu>


Re: increasing HA

From
Thomas Poty
Date:
I want to get :
Automatic failover (promoting a slave) 
Automatic Routing traffic to master 
Fencing in case of node failure. 

I already have 2 asynchronous slaves in hot standby mode + Replication slot. I don't want to add a new node. 



Le mer. 5 sept. 2018 à 18:39, Dmitri Maziuk <dmaziuk@bmrb.wisc.edu> a écrit :
On Wed, 5 Sep 2018 17:45:05 +0200
Thomas Poty <thomas.poty@gmail.com> wrote:

> We have one master and 2 slaves so 3 nodes

So what is the "HA" that you're trying to "increase"? Are you adding a 3rd slave? A 2nd master? A hot standby? All of the above?

--
Dmitri Maziuk <dmaziuk@bmrb.wisc.edu>

Re: increasing HA

From
Nicolas Karolak
Date:
I have a setup kind of like yours, one primary and two stanby, but
with streaming replication.

Here is what i use:
-> on application (Django webapp) servers:
|---> a local HAProxy as a frontend for database servers, doing the
fencing through a check on a custom daemon running on database servers
-> on PostgreSQL servers:
|---> repmgr for the automatic failover
|---> the custom daemon which expose the state of the node (by trying
to retreive its state according to each node's repmgr status table and
doing a kind of vote, for exemple "if according to me i'm the primary
but not according to the two other nodes, i'm fenced")

Regards,
Nicolas KAROLAK | SysAdmin
6-8, rue André Voguet - 94200 Ivry-sur-Seine
+33 1 77 56 77 96 | www.ubicast.eu | www.ubicast.tv

Le mer. 5 sept. 2018 à 18:39, Dmitri Maziuk <dmaziuk@bmrb.wisc.edu> a écrit :
>
> On Wed, 5 Sep 2018 17:45:05 +0200
> Thomas Poty <thomas.poty@gmail.com> wrote:
>
> > We have one master and 2 slaves so 3 nodes
>
> So what is the "HA" that you're trying to "increase"? Are you adding a 3rd slave? A 2nd master? A hot standby? All of
theabove? 
>
> --
> Dmitri Maziuk <dmaziuk@bmrb.wisc.edu>
>


Re: increasing HA

From
Dimitri Maziuk
Date:
On 09/05/2018 12:04 PM, Thomas Poty wrote:
> I want to get :
> Automatic failover (promoting a slave)
> Automatic Routing traffic to master
> Fencing in case of node failure.

Why would you want to fence a node that's already failed? -- You want to
fence off the master during the failover so it does not accept any more
DB updates.

Anyway, here's how we run:
- we provide services to customers; customer-facing service access
points is where we want to reduce downtime.
- We buy decent quality hardware with redundant disks and power
circuits. 99.9999% of the downtime is maintenance reboots.
- We carefully partition or services so that most of them require
read-only access. Those run off slaves.
- The master is not even visible to the customers. Its downtime does not
affect service availability.

Obviously, this works because we can run read-only. If you need
read-write on a public-facing database, the above won't work for you.

--
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu


Attachment