Re: Patch for fail-back without fresh backup - Mailing list pgsql-hackers

From Benedikt Grundmann
Subject Re: Patch for fail-back without fresh backup
Date
Msg-id CADbMkNPivxK2Vi=yLCzz_pCt-zpC0mR38qvVvS4ydxj+q7HLDQ@mail.gmail.com
Whole thread Raw
In response to Patch for fail-back without fresh backup  (Samrat Revagade <revagade.samrat@gmail.com>)
Responses Re: Patch for fail-back without fresh backup
Re: Patch for fail-back without fresh backup
List pgsql-hackers



On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade <revagade.samrat@gmail.com> wrote:

Hello,


We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that:

 

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

 

Let me again summarize the problem we are trying to address.

 

When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage.  So master contains some file system level changes that standby does not have.  At this point, the data directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master.  Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master.  This can happen in both the synchronous as well as asynchronous replication.

 

Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.

 

The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link.  This would break the service level agreement of disaster recovery system.  So there is need to improve the process of disaster recovery in PostgreSQL.  One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.

 

So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.

 

A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL).  Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen.  

There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:

 

pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Reduce maximum error in tuples estimation after vacuum.
Next
From: Samrat Revagade
Date:
Subject: Re: Patch for fail-back without fresh backup