Re: Postgres HA - pacemaker RA do not support auto failback - Mailing list pgsql-general

From Jehan-Guillaume (ioguix) de Rorthais
Subject Re: Postgres HA - pacemaker RA do not support auto failback
Date
Msg-id 20190814002511.6cd5a68a@firost
Whole thread Raw
In response to Re: Fwd: Postgres HA - pacemaker RA do not support auto failback  (Shital A <brightuser2019@gmail.com>)
List pgsql-general
On Tue, 13 Aug 2019 16:38:44 +0530
Shital A <brightuser2019@gmail.com> wrote:

> On Tue, 13 Aug 2019, 11:50 Shital A, <brightuser2019@gmail.com> wrote:
> 
> > On Mon, 12 Aug 2019, 18:54 Adrien Nayrat, <adrien.nayrat@anayrat.info>
> > wrote:
> >  
> >> On 8/12/19 2:57 PM, Shital A wrote:  
> >> > Postgres version : 9.6
> >> > OS:Rhel 7.6
> >> >
> >> > We are working on HA setup for postgres cluster of two nodes in  
> >> > active-passive mode.
> >> >
> >> > Installed:
> >> > Pacemaker 1.1.19
> >> > Corosync 2.4.3
> >> >
> >> > The pacemaker agent with this installation doesn't support automatic  
> >> > failback.  
> >> > What I mean by that is explained below:
> >> > 1. Cluster is setup like A - B with A as master.
> >> > 2. Kill services on A, node B will come up as master.
> >> > 3. node A is ready to join the cluster, we have to delete the lock file  
> >> > it creates on any one of the node and execute the cleanup command to
> >> > get the node back as standby
> >> >
> >> > Step 3 is manual so HA is not achieved in real sense.
> >> >
> >> > Please help to check:
> >> > 1. Is there any version of the resouce agent which supports automatic  
> >> > failback?  
> >> > To avoid generation of lock file and deleting it.
> >> >
> >> > 2. If there is no such support, what checks should be added in pgsql RA  
> >> > to achieve 
> >>
> >> Which RA did you use? AFAIK there is two RA :
> >> - pgsql :
> >> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
> >> - PAF : http://clusterlabs.github.io/PAF/ (Which is different from pgsql
> >> RA :
> >> http://clusterlabs.github.io/PAF/FAQ.html#why-new-ra-for-postgresql)
> >>
> >> If I am right, PAF should handle the case when there no need to perform
> >> pgrewind
> >> (if the old master is not too advanced in the transaction log).

PAF only make sure controlled failover (aka switchover) performs safely so the
old master hook back as a standby correctly.

Should a failure occurs with an automatic failover, PAF has no dark magic to
automagicaly failoback the failing master as a standby. 

> > We are using the pgsql RA installed with pacemaker.
> >
> > Is PAF recommended over pgsql?

Yes, as much as the historical RA.

> > Has anyone changed pgsql to handle the lock file and other cases that
> > might not have been handled?

I don't know. However, I do think it would be either presumptuous to climb this
road today.

> Looks like the previous email bounced, received undelivered message.
> Sending again. We are using the pgsql RA installed with pacemaker and seems
> like it prevents failback by creating a lock file. We want to implement
> auto failback and need pointers on how to implement.

The lock file is here because PgSQL has no safety belt against data corruption
when an old master hook on a new one while its LSN was further away from
the timeline fork. You must take care of this by yourself.

Moreover, should a failover occurs, you should probably check WHY it occurred
and fix things before performing a blindly failback of a potential wounded
instance. A wounded instance can hurt your cluster in many different and
inconfortable ways.

Last, as Adrien wrote, you can manually use pg_rewind to revert the old master
to the timeline fork, as far as you enabled checksums or 'wal_log_hints=on' and
kept enough WAL around.

Another option is just to make sure you are able to build a new standby as was
fast as possible.

But remember: if your data is important, use quorum, fencing and watchdog.

Good luck.



pgsql-general by date:

Previous
From: Stephen Frost
Date:
Subject: Re: GIST/GIN index not used with Row Level Security
Next
From: Derek Hans
Date:
Subject: Re: GIST/GIN index not used with Row Level Security