Thread: Patch for fail-back without fresh backup

Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

14 June 2013, 09:12:00

Hello,

We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that:

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.

So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.

There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:

1. The main objection was raised by Tom and others is that we should not add this feature and should go with traditional way of taking fresh backup using the rsync, because he was concerned about the additional complexity of the patch and the performance overhead during normal operations.

2. Tom and others were also worried about the inconsistencies in the crashed master and suggested that its better to start with a fresh backup. Fujii Masao and others correctly countered that suggesting that we trust WAL recovery to clear all such inconsistencies and there is no reason why we can't do the same here.

3. Someone is suggested using rsync with checksum, but many pages on the two servers may differ in their binary values because of hint bits etc.

4. The major objection for failback without fresh backup idea was it may introduce on performance overhead and complexity to the code. By looking at the patch I must say that patch is not too complex. For performance impact I tested patch with pgbench which shows that it has very small performance overhead. Please refer the test results included at end of mail.

*Proposal to solve the problem*

The proposal is based on the concept of master should not do any file system level change until corresponding WAL record is replicated to the standby.

There are many places in the code which need to be handled to support the proposed solution. Following cases explains the need of fresh backup at the time of failover, and how can we avoid this need by our approach.

1. We must not write any heap pages to the disk before the WAL records corresponding to those changes are received by the standby. Otherwise if standby failed to receive WAL corresponding to those heap pages there will be inconsistency.

2. When CHECKPOINT happens on the master, control file of master gets updated and last checkpoint record is written to it. Suppose failover happens and standby fails to receive the WAL record corresponding to CHECKPOINT, then master and standby has inconsistent copies of control file that leads to the mismatch in redo record and recovery will not start normally. To avoid this situation we must not update the control file of master before the corresponding checkpoint WAL record is received by the standby

3. Also when we truncate any of the physical files on the master and suppose the standby failed to receive corresponding WAL, then that physical file is truncated on master but still available on standby causing inconsistency. To avoid this we must not truncate physical files on the master before the WAL record corresponding to that operation is received by the standby.

4. Same case applies to CLOG pages. If CLOG page is written to the disk and corresponding WAL record is not replicated to the standby, leads to the inconsistency. So we must not write the CLOG pages (and may be other SLRU pages too) to the disk before the corresponding WAL records are received by standby.

5. The same problem applies for the commit hint bits. But it is more complicated than the other problems, because no WAL records are generated for that, hence we cannot apply the same above method, that is wait for corresponding WAL record to be replicated on standby. So we delay the processes of updating the commit hint bits, similar to what is done by asynchronous commits. In other words we need to check if the WAL corresponding to the transaction commit is received by the failback safe standby and then only allow hint bit updates.

*Patch explanation:*

The initial work on this patch is done by Pavan Deolasee. I tested it and will make further enhancements based on the community feedback.

This patch is not complete yet, but I plan to do so with the help of this community. At this point, the primary purpose is to understand the complexities and get some initial performance numbers to alleviate some of the concerns raised by the community.

There are two GUC parameters which supports this failsafe standby

1. failback_safe_standby_name [ name of the failsafe standby ] It is the name of failsafe standby. Master will not do any file system level change before corresponding WAL is replicated on the this failsafe standby

2. failback_safe_standby_mode [ off/remote_write/remote_flush] This parameter specifies the behavior of master i.e. whether it should wait for WAL to be written on standby or WAL to be flushed on standby. We should turn it off when we do not want the failsafe standby. This failsafe mode can be combined with synchronous as well as asynchronous streaming replication.

Most of the changes are done in the syncrep.c. This is a slight misnomer because that file deals with synchronous standby and a failback standby could and most like be a async standby. But keeping the changes this way has ensured that the patch is easy to read. Once we have acceptance on the approach, the patch can be modified to reorganize the code in a more logical way.

The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync standby states. A backend which is waiting for a failback safe standby to receive WAL records, will wait in this state. Failback safe mechanism can work in two different modes, that is wait for WAL to be written or flushed on failsafe standby. That is represented by two new modes SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH respectively.

Also the SyncRepWaitForLSN() is changed for conditional wait. So that we can delay hint bit updates on master instead of blocking the wait for the failback safe standby to receiver WAL's.

*Benchmark tests*

*PostgreSQL versions:* PostgreSQL 9.3beta1

*Usage:* For operating in failsafe mode you need to configure following two GUC parameters:

1. failback_safe_standby_name

2.failback_safe_standby_mode

*Performance impact:*

The test are performed on the servers having 32 GB RAM, checkpoint_timeout is set to 10 minutes so that checkpoint will happen more frequently. Checkpoint involves flushing all the dirty blocks to the disk and we wanted to primarily test that code path.

pgbech settings:

Transaction type: TPC-B

Scaling factor: 100

Query mode: simple

Number of clients: 100

Number of threads: 1

Duration: 1800 s

Following table shows the average TPS measured for each scenario. We conducted 3 tests for each scenario

1) Synchronous Replication - 947 tps

2) Synchronous Replication + Failsafe standby (off) - 934 tps

3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps

4) Asynchronous Replication - 1369 tps

5) Asynchronous Replication + Failsafe standby (off) - 1349 tps

6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps

By observing the table we can conclude following:

1. Streaming replication + failback safe:

a) On an average, synchronous replication combined with failsafe standby (remote_flush) causes 1.68 % performance overhead.

b) On an average, asynchronous streaming replication combined with failsafe standby (remote_flush) causes averagely 1.38 % performance degradation.

2. Streaming replication + failback safe (turned off):

a) Averagely synchronous replication combined with failsafe standby

(off) causes 1.37 % performance overhead.

b) Averagely asynchronous streaming replication combined with failsafe standby (off) causes averagely 1.46 % performance degradation.

So the patch is showing 1-2% performance overhead.

Please give your suggestions if there is a need to perform tests for other scenario.

*Improvements (To-do):*

1. Currently this patch supports only one failback safe standby. It can either be synchronous or an asynchronous standby. We probably need to discuss whether it needs to be changed for support of multiple failsafe standby's.

2. Current design of patch will wait forever for the failback safe standby. Streaming replication also has same limitation. We probably need to discuss whether it needs to be changed.

There are couples of more places that probably need some attention and I have marked them with XXX

Thank you,

Samrat

Attachment

failback_safe_standby.patch

Re: Patch for fail-back without fresh backup

From

Benedikt Grundmann

Date:

14 June 2013, 09:23:35

On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade <revagade.samrat@gmail.com> wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of taking fresh backup during the failback operation here is the link for that:

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But the master may have gone ahead and made changes to its local file system after flushing WAL to the local storage. So master contains some file system level changes that standby does not have. At this point, the data directory of master is ahead of standby's data directory.
Subsequently, the standby will be promoted as new master. Later when the old master wants to be a standby of the new master, it can't just join the setup since there is inconsistency in between these two servers. We need to take the fresh backup from the new master. This can happen in both the synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the current HEAD, the master does not wait for the standby to receive all the WAL up to the shutdown checkpoint record before shutting down the connection. Fujii Masao has already submitted a patch to handle clean switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases are of very big sizes, say several TB's, and when the servers are connected over a relatively slower link. This would break the service level agreement of disaster recovery system. So there is need to improve the process of disaster recovery in PostgreSQL. One way to achieve this is to maintain consistency between master and standby which helps to avoid need of fresh backup.

So our proposal on this problem is that we must ensure that master should not make any file system level changes without confirming that the corresponding WAL record is replicated to the standby.

A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL). Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen.

There are many suggestions and objections pgsql-hackers about this problem The brief summary is as follows:

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

14 June 2013, 09:48:19

That will not happen if there is inconsistency in between both the servers.

Please refer to the discussions on the link provided in the first post:

http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Regards,

Samrat Revgade

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 10:43:08

On 14.06.2013 12:11, Samrat Revagade wrote:
> We have already started a discussion on pgsql-hackers for the problem of
> taking fresh backup during the failback operation here is the link for that:
>
> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com
>
> Let me again summarize the problem we are trying to address.
>
> When the master fails, last few WAL files may not reach the standby. But
> the master may have gone ahead and made changes to its local file system
> after flushing WAL to the local storage.  So master contains some file
> system level changes that standby does not have.  At this point, the data
> directory of master is ahead of standby's data directory.
>
> Subsequently, the standby will be promoted as new master.  Later when the
> old master wants to be a standby of the new master, it can't just join the
> setup since there is inconsistency in between these two servers. We need to
> take the fresh backup from the new master.  This can happen in both the
> synchronous as well as asynchronous replication.

Did you see the thread on the little tool I wrote called pg_rewind?

http://www.postgresql.org/message-id/519DF910.4020609@vmware.com

It solves that problem, for both clean and unexpected shutdown. It needs 
some more work and a lot more testing, but requires no changes to the 
backend. Robert Haas pointed out in that thread that it has a problem 
with hint bits that are not WAL-logged, but it will still work if you 
also enable the new checksums feature, which forces hint bit updates to 
be WAL-logged. Perhaps we could add a GUC to enable hint bits to be 
WAL-logged, regardless of checksums, to make pg_rewind work.

I think that's a more flexible approach to solve this problem. It 
doesn't require an online feedback loop from the standby to master, for 
starters.

- Heikki

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

14 June 2013, 11:07:08

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Robert Haas pointed out in that thread that it has a problem with hint bits that are not WAL-logged,

I liked that tool a lot until Robert pointed out the above problem. I thought this is a show stopper because I can't really see any way to circumvent it unless we enable checksums or explicitly WAL log hint bits.

but it will still work if you also enable the new checksums feature, which forces hint bit updates to be WAL-logged.

Are we expecting a lot of people to run their clusters with checksums on ? Sorry, I haven't followed the checksum discussions and don't know how much overhead it causes. But if the general expectation is that checksums will be turned on most often, I agree pg_rewind is probably good enough.

Perhaps we could add a GUC to enable hint bits to be WAL-logged, regardless of checksums, to make pg_rewind work.

Wouldn't that be too costly ? I mean, in the worst case every hint bit on a page may get updated separately. If each such update is WAL logged, we are looking for a lot more unnecessary WAL traffic.

I think that's a more flexible approach to solve this problem. It doesn't require an online feedback loop from the standby to master, for starters.

I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work with 9.3 and below because of the hint bits issue, otherwise it would have been even more cool.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

14 June 2013, 11:09:28

On Fri, Jun 14, 2013 at 2:51 PM, Benedikt Grundmann <bgrundmann@janestreet.com> wrote:

A alternative proposal (which will probably just reveal my lack of understanding about what is or isn't possible with WAL). Provide a way to restart the master so that it rolls back the WAL changes that the slave hasn't seen.

WAL records in PostgreSQL can only be used for physical redo. They can not be used for undo. So what you're suggesting is not possible though I am sure a few other databases do that.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 11:20:18

On 14.06.2013 14:06, Pavan Deolasee wrote:
> On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>>   Robert Haas pointed out in that thread that it has a problem with hint
>> bits that are not WAL-logged,
>
> I liked that tool a lot until Robert pointed out the above problem. I
> thought this is a show stopper because I can't really see any way to
> circumvent it unless we enable checksums or explicitly WAL log hint bits.
>
>> but it will still work if you also enable the new checksums feature, which
>> forces hint bit updates to be WAL-logged.
>
> Are we expecting a lot of people to run their clusters with checksums on ?
> Sorry, I haven't followed the checksum discussions and don't know how much
> overhead it causes. But if the general expectation is that checksums will
> be turned on most often, I agree pg_rewind is probably good enough.

Well, time will tell I guess. The biggest overhead with the checksums is 
exactly the WAL-logging of hint bits.

>> Perhaps we could add a GUC to enable hint bits to be WAL-logged,
>> regardless of checksums, to make pg_rewind work.
>
> Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
> page may get updated separately. If each such update is WAL logged, we are
> looking for a lot more unnecessary WAL traffic.

Yep, same as with checksums. I was not very enthusiastic about the 
checksums patch because of that, but a lot of people are willing to pay 
that price. Maybe we can figure out a way to reduce that cost in 9.4. 
It'd benefit the checksums greatly.

For pg_rewind, we wouldn't actually need a full-page image for hint bit 
updates, just a small record saying "hey, I touched this page". And 
you'd only need to write that the first time a page is touched after a 
checkpoint.

>> I think that's a more flexible approach to solve this problem. It doesn't
>> require an online feedback loop from the standby to master, for starters.
>
> I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
> with 9.3 and below because of the hint bits issue, otherwise it would have
> been even more cool.

The proposed patch is clearly not 9.3 material either. If anything, 
there's a much better change that we could still sneak in a GUC to allow 
hint bits to be WAL-logged without checksums in 9.3. All the code is 
there, it'd just be a new guc to control it separetely from checksums.

- Heikki

Re: Patch for fail-back without fresh backup

From

Greg Stark

Date:

14 June 2013, 11:30:44

On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> For pg_rewind, we wouldn't actually need a full-page image for hint bit
> updates, just a small record saying "hey, I touched this page". And you'd
> only need to write that the first time a page is touched after a checkpoint.

I would expect that to be about the same cost though. The latency for
the fsync on the wal record before being able to flush the buffer is
the biggest cost.

> The proposed patch is clearly not 9.3 material either. If anything, there's
> a much better change that we could still sneak in a GUC to allow hint bits
> to be WAL-logged without checksums in 9.3. All the code is there, it'd just
> be a new guc to control it separetely from checksums.

On the other hand if you're going to wal log the hint bits why not
enable checksums?

Do we allow turning off checksums after a database is initdb'd? IIRC
we can't turn it on later but I don't see why we couldn't turn them
off.

-- 
greg

Re: Patch for fail-back without fresh backup

From

Tom Lane

Date:

14 June 2013, 13:08:25

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> Well, time will tell I guess. The biggest overhead with the checksums is 
> exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.  Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.
        regards, tom lane

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

14 June 2013, 13:15:44

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> Hello,

> We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:
 
>
http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com
 
> Let me again summarize the problem we are trying to address.
 
> When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after > flushing WAL to the local storage.  So master contains some file
system level changes that standby does not have.  At this point, the data
directory of > master is ahead of standby's data directory.
> Subsequently, the standby will be promoted as new master.  Later when the
old master wants to be a standby of the new master, it can't just join the
> setup since there is inconsistency in between these two servers. We need
to take the fresh backup from the new master.  This can happen in both the
> synchronous as well as asynchronous replication.
 
> Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL
> up to the shutdown checkpoint record before shutting down the connection.
Fujii Masao has already submitted a patch to handle clean switch-over case,
> but the problem is still remaining for failback case.
 
> The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
> over a relatively slower link.  This would break the service level
agreement of disaster recovery system.  So there is need to improve the
process of
> disaster recovery in PostgreSQL.  One way to achieve this is to maintain
consistency between master and standby which helps to avoid need of fresh
> backup.
 
> So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
> corresponding WAL record is replicated to the standby.
  How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

14 June 2013, 13:15:49

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > Well, time will tell I guess. The biggest overhead with the checksums is 
> > exactly the WAL-logging of hint bits.
> 
> Refresh my memory as to why we need to WAL-log hints for checksumming?
> I just had my nose in the part of the checksum patch that tediously
> copies entire pages out of shared buffers to avoid possible instability
> of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Jeff Davis and I were talking about whether the usage of
PGXAC->delayChkpt makes the whole thing sufficiently safe at pgcon - we
couldn't find any real danger but...

> Given that we're
> paying that cost, I don't see why we'd need to do any extra WAL-logging
> (above and beyond the log-when-freeze cost that we have to pay already).
> But I've not absorbed any caffeine yet today, so maybe I'm just missing
> it.

The usual torn page spiel I think. If we crash while only one half of
the page made it to disk we would get spurious checksum failures from
there on.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 13:17:36

On 14.06.2013 16:08, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> Well, time will tell I guess. The biggest overhead with the checksums is
>> exactly the WAL-logging of hint bits.
>
> Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

1. Backend sets a hint bit, dirtying the buffer.
2. Checksum is calculated, and buffer is written out to disk.
3. <crash>

If the page is torn, the checksum won't match. Without checksums, a torn 
page is not a problem with hint bits, as a single bit can't be torn and 
the page is otherwise intact. But with checksums, it causes a checksum 
failure.

- Heikki

Re: Patch for fail-back without fresh backup

From

Tom Lane

Date:

14 June 2013, 13:21:56

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 14.06.2013 16:08, Tom Lane wrote:
>> Refresh my memory as to why we need to WAL-log hints for checksumming?

> Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?
        regards, tom lane

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 13:36:47

On 14.06.2013 16:21, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> On 14.06.2013 16:08, Tom Lane wrote:
>>> Refresh my memory as to why we need to WAL-log hints for checksumming?
>
>> Torn pages:
>
> So it's not that we actually need to log the individual hint bit
> changes, it's that we need to WAL-log a full page image on the first
> update after a checkpoint, so as to recover from torn-page cases.
> Which one are we doing?

Correct. We're doing the latter, see XLogSaveBufferForHint().

- Heikki

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

14 June 2013, 13:39:17

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 14.06.2013 16:08, Tom Lane wrote:
> >> Refresh my memory as to why we need to WAL-log hints for checksumming?
> 
> > Torn pages:
> 
> So it's not that we actually need to log the individual hint bit
> changes, it's that we need to WAL-log a full page image on the first
> update after a checkpoint, so as to recover from torn-page cases.
> Which one are we doing?

MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via
XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by
comparing GetRedoRecPtr() with the page's lsn.
Otherwise we don't do anything besides marking the buffer dirty.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 13:58:46

On 14.06.2013 16:15, Andres Freund wrote:
> On 2013-06-14 09:08:15 -0400, Tom Lane wrote:
>> I just had my nose in the part of the checksum patch that tediously
>> copies entire pages out of shared buffers to avoid possible instability
>> of the hint bits while we checksum and write the page.
>
> I am really rather uncomfortable with that piece of code, and I hacked
> it up after Jeff Janes had reported a bug there (The one aborting WAL
> replay to early...). So I am very happy that you are looking at it.

Hmm. In XLogSaveBufferForHint():

>  * Note that this only works for buffers that fit the standard page model,
>  * i.e. those for which buffer_std == true

The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). 
Isn't that completely broken for the FSM? If I'm reading it correctly, 
what will happen is that replay will completely zero out all FSM pages 
that have been touched. All the FSM data is between pd_lower and 
pd_upper, which on standard pages is the "hole".

- Heikki

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

14 June 2013, 14:01:17

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 14.06.2013 16:08, Tom Lane wrote:
> >> Refresh my memory as to why we need to WAL-log hints for checksumming?
> 
> > Torn pages:
> 
> So it's not that we actually need to log the individual hint bit
> changes, it's that we need to WAL-log a full page image on the first
> update after a checkpoint, so as to recover from torn-page cases.
> Which one are we doing?

From quickly looking at the code again I think the MarkBufferDirtyHint()
code makes at least one assumption that isn't correct in the fact of
checksums.

It tests for the need to dirty the page with:if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=    (BM_DIRTY |
BM_JUST_DIRTIED))

*before* taking a lock. A comment explains why that is safe:
        * Since we make this test unlocked, there's a chance we * might fail to notice that the flags have just been
cleared,and failed * to reset them, due to memory-ordering issues.

That's fine for the classical usecase without checksums but what about
the following scenario:

1) page is dirtied, FPI is logged
2) SetHintBits gets called on the same page, holding only a share lock
3) checkpointer/bgwriter/... writes out the the page, clearing the dirty  flag
4) checkpoint finishes, updates redo ptr
5) SetHintBits actually modifies the hint bits
6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the  page isn't dirty anymore and thus doesn't check
whethersomething  needs to get logged.

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

Since

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

14 June 2013, 14:05:43

On 14.06.2013 17:01, Andres Freund wrote:
> At this point we have a page that has been modified without an FPI. But
> it's not marked dirty, so it won't be written out without further
> cause. Which might be fine since there's no cause to write out the page
> and there probably won't be anyone doing that without logging an FPI
> independently.
> Can anybody see a scenario where this is actually dangerous?

The code also relies on that being safe during recovery:

>              * If we're in recovery we cannot dirty a page because of a hint.
>              * We can set the hint, just not dirty the page as a result so the
>              * hint is lost when we evict the page or shutdown.
>              *
>              * See src/backend/storage/page/README for longer discussion.
>              */
>             if (RecoveryInProgress())
>                 return;

I can't immediately see a problem with that.

- Heikki

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

14 June 2013, 14:10:21

On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote:
> On 14.06.2013 16:15, Andres Freund wrote:
> >On 2013-06-14 09:08:15 -0400, Tom Lane wrote:
> >>I just had my nose in the part of the checksum patch that tediously
> >>copies entire pages out of shared buffers to avoid possible instability
> >>of the hint bits while we checksum and write the page.
> >
> >I am really rather uncomfortable with that piece of code, and I hacked
> >it up after Jeff Janes had reported a bug there (The one aborting WAL
> >replay to early...). So I am very happy that you are looking at it.
> 
> Hmm. In XLogSaveBufferForHint():
> 
> > * Note that this only works for buffers that fit the standard page model,
> > * i.e. those for which buffer_std == true
> 
> The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't
> that completely broken for the FSM? If I'm reading it correctly, what will
> happen is that replay will completely zero out all FSM pages that have been
> touched. All the FSM data is between pd_lower and pd_upper, which on
> standard pages is the "hole".

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Greg Stark

Date:

14 June 2013, 14:40:17

On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So it's not that we actually need to log the individual hint bit
> changes, it's that we need to WAL-log a full page image on the first
> update after a checkpoint, so as to recover from torn-page cases.
> Which one are we doing?

Wal logging a full page image after a checkpoint wouldn't actually be
enough since subsequent hint bits will dirty the page and not wal log
anything creating a new torn page risk. FPI are only useful if all the
subsequent updates are wal logged.

-- 
greg

Re: Patch for fail-back without fresh backup

From

Tom Lane

Date:

14 June 2013, 14:41:40

Greg Stark <stark@mit.edu> writes:
> On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So it's not that we actually need to log the individual hint bit
>> changes, it's that we need to WAL-log a full page image on the first
>> update after a checkpoint, so as to recover from torn-page cases.
>> Which one are we doing?

> Wal logging a full page image after a checkpoint wouldn't actually be
> enough since subsequent hint bits will dirty the page and not wal log
> anything creating a new torn page risk. FPI are only useful if all the
> subsequent updates are wal logged.

No, there's no new torn page risk, because any crash recovery would
replay starting from the checkpoint.  You might lose the
subsequently-set hint bits, but that's okay.
        regards, tom lane

Re: Patch for fail-back without fresh backup

From

Jeff Davis

Date:

14 June 2013, 16:21:09

On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:
> Jeff Davis has a patch pending
> (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
> flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
> were on track committing that, but rereading the thread it doesn't look
> that way.
>
> Jeff, care to update that patch?

Rebased and attached. Changed so all callers use buffer_std=true except
those in freespace.c and fsmpage.c.

Simon, did you (or anyone else) have an objection to this patch? If not,
I'll go ahead and commit it tomorrow morning.

Regards,
    Jeff Davis

Attachment

buffer-std-20130614.patch

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

14 June 2013, 16:27:08

On 2013-06-14 09:21:12 -0700, Jeff Davis wrote:
> On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:
> > Jeff Davis has a patch pending
> > (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
> > flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
> > were on track committing that, but rereading the thread it doesn't look
> > that way.
> > 
> > Jeff, care to update that patch?
> 
> Rebased and attached. Changed so all callers use buffer_std=true except
> those in freespace.c and fsmpage.c.
> 
> Simon, did you (or anyone else) have an objection to this patch? If not,
> I'll go ahead and commit it tomorrow morning.

I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
that mentions that they are safe in a non std buffer due to
XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
change of the copy algorithm there.

Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI
or something like that. I find the former name confusing...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

15 June 2013, 07:49:49

On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
>> Hello,
>
>> We have already started a discussion on pgsql-hackers for the problem of
> taking fresh backup during the failback operation here is the link for that:
>
>>
> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
> JgWrFu513s+Q@mail.gmail.com
>
>> Let me again summarize the problem we are trying to address.
>
>> When the master fails, last few WAL files may not reach the standby. But
> the master may have gone ahead and made changes to its local file system
> after > flushing WAL to the local storage.  So master contains some file
> system level changes that standby does not have.  At this point, the data
> directory of > master is ahead of standby's data directory.
>> Subsequently, the standby will be promoted as new master.  Later when the
> old master wants to be a standby of the new master, it can't just join the
>> setup since there is inconsistency in between these two servers. We need
> to take the fresh backup from the new master.  This can happen in both the
>> synchronous as well as asynchronous replication.
>
>> Fresh backup is also needed in case of clean switch-over because in the
> current HEAD, the master does not wait for the standby to receive all the
> WAL
>> up to the shutdown checkpoint record before shutting down the connection.
> Fujii Masao has already submitted a patch to handle clean switch-over case,
>> but the problem is still remaining for failback case.
>
>> The process of taking fresh backup is very time consuming when databases
> are of very big sizes, say several TB's, and when the servers are connected
>> over a relatively slower link.  This would break the service level
> agreement of disaster recovery system.  So there is need to improve the
> process of
>> disaster recovery in PostgreSQL.  One way to achieve this is to maintain
> consistency between master and standby which helps to avoid need of fresh
>> backup.
>
>> So our proposal on this problem is that we must ensure that master should
> not make any file system level changes without confirming that the
>> corresponding WAL record is replicated to the standby.
>
>   How will you take care of extra WAL on old master during recovery. If it
> plays the WAL which has not reached new-master, it can be a problem.

you means that there is possible that old master's data ahead of new
master's data.
so there is inconsistent data between those server when fail back. right?
if so , there is not possible inconsistent. because if you use GUC option
as his propose (i.g., failback_safe_standby_mode = remote_flush),
when old master is working fine, all file system level changes aren't
done  before WAL replicated.

--
Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit kapila

Date:

15 June 2013, 13:34:55

On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
>> Hello,
>
>>> We have already started a discussion on pgsql-hackers for the problem of
>>> taking fresh backup during the failback operation here is the link for that:
>>>
>>>
>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
>>> JgWrFu513s+Q@mail.gmail.com
>>>
>>> Let me again summarize the problem we are trying to address.
>>
>>
>>   How will you take care of extra WAL on old master during recovery. If it
>> plays the WAL which has not reached new-master, it can be a problem.

> you means that there is possible that old master's data ahead of new
> master's data.
 I mean to say is that WAL of old master can be ahead of new master. I understood that data files of old master can't
beahead, but I think WAL can be ahead. 

> so there is inconsistent data between those server when fail back. right?
> if so , there is not possible inconsistent. because if you use GUC option
> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> when old master is working fine, all file system level changes aren't
> done  before WAL replicated.

Would the propose patch will take care that old master's WAL is also not ahead in some way?
If yes, I think i am missing some point.

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

15 June 2013, 14:59:06

On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:
>
> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
>>> Hello,
>>
>>>> We have already started a discussion on pgsql-hackers for the problem of
>>>> taking fresh backup during the failback operation here is the link for that:
>>>>
>>>>
>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
>>>> JgWrFu513s+Q@mail.gmail.com
>>>>
>>>> Let me again summarize the problem we are trying to address.
>>>
>>>
>>>   How will you take care of extra WAL on old master during recovery. If it
>>> plays the WAL which has not reached new-master, it can be a problem.
>
>> you means that there is possible that old master's data ahead of new
>> master's data.
>
>   I mean to say is that WAL of old master can be ahead of new master. I understood that
>   data files of old master can't be ahead, but I think WAL can be ahead.
>
>> so there is inconsistent data between those server when fail back. right?
>> if so , there is not possible inconsistent. because if you use GUC option
>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
>> when old master is working fine, all file system level changes aren't
>> done  before WAL replicated.
>
> Would the propose patch will take care that old master's WAL is also not ahead in some way?
> If yes, I think i am missing some point.

yes it will happen that old master's WAL ahead of new master's WAL as you said.
but I think that we can solve them by delete all WAL file when old
master starts as new standby.thought?

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Jeff Davis

Date:

15 June 2013, 18:36:56

On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote:
> I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
> that mentions that they are safe in a non std buffer due to
> XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
> change of the copy algorithm there.

Done.

> Btw, if you touch that code, I'd vote for renaming XLOG_HINT to XLOG_FPI
> or something like that. I find the former name confusing...

Also done.

Patch attached. Also, since we branched, I think this should be
back-patched to 9.3 as well.

Regards,
    Jeff Davis

Attachment

buffer-std-20130615.patch

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

15 June 2013, 21:30:53

On 2013-06-15 11:36:54 -0700, Jeff Davis wrote:
> On Fri, 2013-06-14 at 18:27 +0200, Andres Freund wrote:
> > I'd like to see a comment around the memcpys in XLogSaveBufferForHint()
> > that mentions that they are safe in a non std buffer due to
> > XLogCheckBuffer setting an appropriate hole/offset. Or make an explicit
> > change of the copy algorithm there.
> 
> Done.
> Also done.

Thanks! Looks good to me.

> Patch attached. Also, since we branched, I think this should be
> back-patched to 9.3 as well.

Absolutely.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Amit kapila

Date:

16 June 2013, 05:00:52

On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:
>
> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
>>> Hello,
>>
>>>>> We have already started a discussion on pgsql-hackers for the problem of
>>>>> taking fresh backup during the failback operation here is the link for that:
>>>>>
>>>>>
>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
>>>>> JgWrFu513s+Q@mail.gmail.com
>>>>>
>>>>> Let me again summarize the problem we are trying to address.
>>>>>
>>>>>
>>>>   How will you take care of extra WAL on old master during recovery. If it
>>>> plays the WAL which has not reached new-master, it can be a problem.
>
>>> you means that there is possible that old master's data ahead of new
>>> master's data.
>
>>   I mean to say is that WAL of old master can be ahead of new master. I understood that
>>   data files of old master can't be ahead, but I think WAL can be ahead.
>
>>> so there is inconsistent data between those server when fail back. right?
>>> if so , there is not possible inconsistent. because if you use GUC option
>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
>>> when old master is working fine, all file system level changes aren't
>>> done  before WAL replicated.
>
>> Would the propose patch will take care that old master's WAL is also not ahead in some way?
>> If yes, I think i am missing some point.

> yes it will happen that old master's WAL ahead of new master's WAL as you said.
> but I think that we can solve them by delete all WAL file when old
> master starts as new standby.

I think ideally, it should reset WAL location at the point where new master has forrked off.
In such a scenario it would be difficult for user who wants to get a dump of some data in
old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it
is there, then providing this solution will have some drawbacks.

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Simon Riggs

Date:

16 June 2013, 09:55:33

On 14 June 2013 17:21, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2013-06-14 at 16:10 +0200, Andres Freund wrote:
>> Jeff Davis has a patch pending
>> (1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
>> flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
>> were on track committing that, but rereading the thread it doesn't look
>> that way.
>>
>> Jeff, care to update that patch?
>
> Rebased and attached. Changed so all callers use buffer_std=true except
> those in freespace.c and fsmpage.c.
>
> Simon, did you (or anyone else) have an objection to this patch? If not,
> I'll go ahead and commit it tomorrow morning.

I didn't have a specific objection to the patch, I just wanted to
minimise change relating to this so we didn't introduce further bugs.

I've no objection to you committing that.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Simon Riggs

Date:

16 June 2013, 11:41:10

On 14 June 2013 10:11, Samrat Revagade <revagade.samrat@gmail.com> wrote:

> We have already started a discussion on pgsql-hackers for the problem of
> taking fresh backup during the failback operation here is the link for that:
>
> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

> So our proposal on this problem is that we must ensure that master should
> not make any file system level changes without confirming that the
> corresponding WAL record is replicated to the standby.

> 1. The main objection was raised by Tom and others is that we should not add
> this feature and should go with traditional way of taking fresh backup using
> the rsync, because he was concerned about the additional complexity of the
> patch and the performance overhead during normal operations.
>
> 2. Tom and others were also worried about the inconsistencies in the crashed
> master and suggested that its better to start with a fresh backup. Fujii
> Masao and others correctly countered that suggesting that we trust WAL
> recovery to clear all such inconsistencies and there is no reason why we
> can't do the same here.

> So the patch is showing 1-2% performance overhead.

Let's have a look at this...

The objections you summarise that Tom has made are ones that I agree
with. I also don't think that Fujii "correctly countered" those
objections.

My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits - measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

16 June 2013, 16:25:44

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

I agree with you about the fact that,

Now a days the need of fresh backup in crash recovery seems to be a major problem.

we might need to change the name of patch if there other problems too with crash recovery.

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

The different set of parameters are needed to differentiate between fail-safe standby and synchronous standby, the fail-safe standby and standby in synchronous replication can be two different servers.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits - measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I will test the other scenarios and post the results.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Regards,

Samrat Revgade

Re: Patch for fail-back without fresh backup

From

Simon Riggs

Date:

16 June 2013, 17:38:17

On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote:
>
>
> On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>>
>> So I strongly object to calling this patch anything to do with
>> "failback safe". You simply don't have enough data to make such a bold
>> claim. (Which is why we call it synchronous replication and not "zero
>> data loss", for example).
>>
>> But that's not the whole story. I can see some utility in a patch that
>> makes all WAL transfer synchronous, rather than just commits. Some
>> name like synchronous_transfer might be appropriate. e.g.
>> synchronous_transfer = all | commit (default).
>>
>
> I agree with you about the fact that,
> Now a days the need of fresh backup in crash recovery  seems to be a major
> problem.
> we might need to change the name of patch if there other problems too with
> crash recovery.

(Sorry don't understand)

>> The idea of another slew of parameters that are very similar to
>> synchronous replication but yet somehow different seems weird. I can't
>> see a reason why we'd want a second lot of parameters. Why not just
>> use the existing ones for sync rep? (I'm surprised the Parameter
>> Police haven't visited you in the night...) Sure, we might want to
>> expand the design for how we specify multi-node sync rep, but that is
>> a different patch.
>>
>
> The different set of parameters are needed to differentiate between
> fail-safe standby and synchronous standby, the fail-safe standby and standby
> in synchronous replication can be two different servers.

Why would they be different? What possible reason would you have for
that config? There is no *need* for those parameters, the proposal
could work perfectly well without them.

Let's make this patch fulfill the stated objectives, not add in
optional extras, especially ones that don't appear well thought
through. If you wish to enhance the design for the specification of
multi-node sync rep, make that a separate patch, later.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

17 June 2013, 06:32:19

On Sun, Jun 16, 2013 at 11:08 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 16 June 2013 17:25, Samrat Revagade <revagade.samrat@gmail.com> wrote:
>
>
> On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>>
>>
>> So I strongly object to calling this patch anything to do with
>> "failback safe". You simply don't have enough data to make such a bold
>> claim. (Which is why we call it synchronous replication and not "zero
>> data loss", for example).
>>
>> But that's not the whole story. I can see some utility in a patch that
>> makes all WAL transfer synchronous, rather than just commits. Some
>> name like synchronous_transfer might be appropriate. e.g.
>> synchronous_transfer = all | commit (default).
>>
>
> I agree with you about the fact that,
> Now a days the need of fresh backup in crash recovery seems to be a major
> problem.
> we might need to change the name of patch if there other problems too with
> crash recovery.

(Sorry don't understand)

Sorry for the confusion. I will change name of a patch.

>> The idea of another slew of parameters that are very similar to
>> synchronous replication but yet somehow different seems weird. I can't
>> see a reason why we'd want a second lot of parameters. Why not just
>> use the existing ones for sync rep? (I'm surprised the Parameter
>> Police haven't visited you in the night...) Sure, we might want to
>> expand the design for how we specify multi-node sync rep, but that is
>> a different patch.
>>
>
> The different set of parameters are needed to differentiate between
> fail-safe standby and synchronous standby, the fail-safe standby and standby
> in synchronous replication can be two different servers.

Why would they be different? What possible reason would you have for
that config? There is no *need* for those parameters, the proposal
could work perfectly well without them.

Let's make this patch fulfill the stated objectives, not add in
optional extras, especially ones that don't appear well thought
through. If you wish to enhance the design for the specification of
multi-node sync rep, make that a separate patch, later.

I agree with you.I will remove the extra parameters if they are not required in next version of the patch.

Regards,

Samrat Revgade

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

17 June 2013, 08:03:28

On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

My perspective is that if the master crashed, assuming that you know
everything about that and suddenly jumping back on seem like a recipe
for disaster. Attempting that is currently blocked by the technical
obstacles you've identified, but that doesn't mean they are the only
ones - we don't yet understand what all the problems lurking might be.
Personally, I won't be following you onto that minefield anytime soon.

Would it be fair to say that a user will be willing to trust her crashed master in all scenarios where she would have done so in a single instance setup ? IOW without the replication setup, AFAIU users have traditionally trusted the WAL recovery to recover from failed instances. This would include some common failures such as power outages and hardware failures, but may not include others such as on disk corruption.

So I strongly object to calling this patch anything to do with
"failback safe". You simply don't have enough data to make such a bold
claim. (Which is why we call it synchronous replication and not "zero
data loss", for example).

I agree. We should probably find a better name for this. Any suggestions ?

But that's not the whole story. I can see some utility in a patch that
makes all WAL transfer synchronous, rather than just commits. Some
name like synchronous_transfer might be appropriate. e.g.
synchronous_transfer = all | commit (default).

Its an interesting idea, but I think there is some difference here. For example, the proposed feature allows a backend to wait at other points but not commit. Since commits are more foreground in nature and this feature does not require us to wait during common foreground activities, we want a configuration where master can wait for synchronous transfers at other than commits. May we can solve that by having more granular control to the said parameter ?

The idea of another slew of parameters that are very similar to
synchronous replication but yet somehow different seems weird. I can't
see a reason why we'd want a second lot of parameters. Why not just
use the existing ones for sync rep? (I'm surprised the Parameter
Police haven't visited you in the night...) Sure, we might want to
expand the design for how we specify multi-node sync rep, but that is
a different patch.

How would we then distinguish between synchronous and the new kind of standby ? I am told, one of the very popular setups for DR is to have one local sync standby and one async (may be cascaded by the local sync). Since this new feature is more useful for DR because taking a fresh backup on a slower link is even more challenging, IMHO we should support such setups.

I'm worried to see that adding this feature and yet turning it off
causes a measureable drop in performance. I don't think we want that
at all. That clearly needs more work and thought.

I agree. We need to repeat those tests. I don't trust that turning the feature is causing 1-2% drop. In one of the tests, I see turning the feature on is showing better number compared to when its turn off. That's clearly noise or need concrete argument to convince that way.

I also think your performance results are somewhat bogus. Fast
transaction workloads were already mostly commit waits -

But not in case of async standby, right ?

measurements
of what happens to large loads, index builds etc would likely reveal
something quite different.

I agree. I also feel we need tests where the FlushBuffer gets called more often by the normal backends to see how much added wait in that code path causes performance drops. Another important thing to test would be to see how it works on a slower/high latency links.

I'm tempted by the thought that we should put the WaitForLSN inside
XLogFlush, rather than scatter additional calls everywhere and then
have us inevitably miss one.

That indeed seems cleaner.

Thanks,

Pavan

Re: Patch for fail-back without fresh backup

From

Simon Riggs

Date:

17 June 2013, 11:49:03

On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

> I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

>> But that's not the whole story. I can see some utility in a patch that
>> makes all WAL transfer synchronous, rather than just commits. Some
>> name like synchronous_transfer might be appropriate. e.g.
>> synchronous_transfer = all | commit (default).

> Since commits are more foreground in nature and this feature
> does not require us to wait during common foreground activities, we want a
> configuration where master can wait for synchronous transfers at other than
> commits. May we can solve that by having more granular control to the said
> parameter ?
>
>>
>> The idea of another slew of parameters that are very similar to
>> synchronous replication but yet somehow different seems weird. I can't
>> see a reason why we'd want a second lot of parameters. Why not just
>> use the existing ones for sync rep? (I'm surprised the Parameter
>> Police haven't visited you in the night...) Sure, we might want to
>> expand the design for how we specify multi-node sync rep, but that is
>> a different patch.
>
>
> How would we then distinguish between synchronous and the new kind of
> standby ?

That's not the point. The point is "Why would we have a new kind of
standby?" and therefore why do we need new parameters?

> I am told, one of the very popular setups for DR is to have one
> local sync standby and one async (may be cascaded by the local sync). Since
> this new feature is more useful for DR because taking a fresh backup on a
> slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

Phew

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

17 June 2013, 18:48:30

On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila <amit.kapila@huawei.com> wrote:
>>
>> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
>> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
>>>> Hello,
>>>
>>>>>> We have already started a discussion on pgsql-hackers for the problem of
>>>>>> taking fresh backup during the failback operation here is the link for that:
>>>>>>
>>>>>>
>>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
>>>>>> JgWrFu513s+Q@mail.gmail.com
>>>>>>
>>>>>> Let me again summarize the problem we are trying to address.
>>>>>>
>>>>>>
>>>>>   How will you take care of extra WAL on old master during recovery. If it
>>>>> plays the WAL which has not reached new-master, it can be a problem.
>>
>>>> you means that there is possible that old master's data ahead of new
>>>> master's data.
>>
>>>   I mean to say is that WAL of old master can be ahead of new master. I understood that
>>>   data files of old master can't be ahead, but I think WAL can be ahead.
>>
>>>> so there is inconsistent data between those server when fail back. right?
>>>> if so , there is not possible inconsistent. because if you use GUC option
>>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
>>>> when old master is working fine, all file system level changes aren't
>>>> done  before WAL replicated.
>>
>>> Would the propose patch will take care that old master's WAL is also not ahead in some way?
>>> If yes, I think i am missing some point.
>
>> yes it will happen that old master's WAL ahead of new master's WAL as you said.
>> but I think that we can solve them by delete all WAL file when old
>> master starts as new standby.
>
> I think ideally, it should reset WAL location at the point where new master has forrked off.
> In such a scenario it would be difficult for user who wants to get a dump of some data in
> old master which hasn't gone to new master. I am not sure if such a need is there for real users, but if it
> is there, then providing this solution will have some drawbacks.
I think that we can dumping data before all WAL files deleting.  All
WAL files deleting is done when old master starts as new standby.


Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

18 June 2013, 04:52:59

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>
> >>>>>> We have already started a discussion on pgsql-hackers for the
> problem of
> >>>>>> taking fresh backup during the failback operation here is the
> link for that:
> >>>>>>
> >>>>>>
> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
> 6OzRaew5pWhk7yQtb
> >>>>>> JgWrFu513s+Q@mail.gmail.com
> >>>>>>
> >>>>>> Let me again summarize the problem we are trying to address.
> >>>>>>
> >>>>>>
> >>>>>   How will you take care of extra WAL on old master during
> recovery. If it
> >>>>> plays the WAL which has not reached new-master, it can be a
> problem.
> >>
> >>>> you means that there is possible that old master's data ahead of
> new
> >>>> master's data.
> >>
> >>>   I mean to say is that WAL of old master can be ahead of new
> master. I understood that
> >>>   data files of old master can't be ahead, but I think WAL can be
> ahead.
> >>
> >>>> so there is inconsistent data between those server when fail back.
> right?
> >>>> if so , there is not possible inconsistent. because if you use GUC
> option
> >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> >>>> when old master is working fine, all file system level changes
> aren't
> >>>> done  before WAL replicated.
> >>
> >>> Would the propose patch will take care that old master's WAL is
> also not ahead in some way?
> >>> If yes, I think i am missing some point.
> >
> >> yes it will happen that old master's WAL ahead of new master's WAL
> as you said.
> >> but I think that we can solve them by delete all WAL file when old
> >> master starts as new standby.
> >
> > I think ideally, it should reset WAL location at the point where new
> master has forrked off.
> > In such a scenario it would be difficult for user who wants to get a
> dump of some data in
> > old master which hasn't gone to new master. I am not sure if such a
> need is there for real users, but if it
> > is there, then providing this solution will have some drawbacks.

> I think that we can dumping data before all WAL files deleting.  All
> WAL files deleting is done when old master starts as new standby.
 Can we dump data without starting server?

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

18 June 2013, 14:11:20

On Tuesday, June 18, 2013, Amit Kapila wrote:

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>
> >>>>>> We have already started a discussion on pgsql-hackers for the
> problem of
> >>>>>> taking fresh backup during the failback operation here is the
> link for that:
> >>>>>>
> >>>>>>
> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
> 6OzRaew5pWhk7yQtb
> >>>>>> JgWrFu513s+Q@mail.gmail.com
> >>>>>>
> >>>>>> Let me again summarize the problem we are trying to address.
> >>>>>>
> >>>>>>
> >>>>> How will you take care of extra WAL on old master during
> recovery. If it
> >>>>> plays the WAL which has not reached new-master, it can be a
> problem.
> >>
> >>>> you means that there is possible that old master's data ahead of
> new
> >>>> master's data.
> >>
> >>> I mean to say is that WAL of old master can be ahead of new
> master. I understood that
> >>> data files of old master can't be ahead, but I think WAL can be
> ahead.
> >>
> >>>> so there is inconsistent data between those server when fail back.
> right?
> >>>> if so , there is not possible inconsistent. because if you use GUC
> option
> >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> >>>> when old master is working fine, all file system level changes
> aren't
> >>>> done before WAL replicated.
> >>
> >>> Would the propose patch will take care that old master's WAL is
> also not ahead in some way?
> >>> If yes, I think i am missing some point.
> >
> >> yes it will happen that old master's WAL ahead of new master's WAL
> as you said.
> >> but I think that we can solve them by delete all WAL file when old
> >> master starts as new standby.
> >
> > I think ideally, it should reset WAL location at the point where new
> master has forrked off.
> > In such a scenario it would be difficult for user who wants to get a
> dump of some data in
> > old master which hasn't gone to new master. I am not sure if such a
> need is there for real users, but if it
> > is there, then providing this solution will have some drawbacks.

> I think that we can dumping data before all WAL files deleting. All
> WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

Sorry I made a mistake. We can't it.

this proposing patch need to be able to also handle such scenario in future.

Regards,

---

Sawada Masahiko

--
Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

19 June 2013, 17:14:44

On Tuesday, June 18, 2013, Amit Kapila wrote:

On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>
> >>>>>> We have already started a discussion on pgsql-hackers for the
> problem of
> >>>>>> taking fresh backup during the failback operation here is the
> link for that:
> >>>>>>
> >>>>>>
> >>>>>> http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
> 6OzRaew5pWhk7yQtb
> >>>>>> JgWrFu513s+Q@mail.gmail.com
> >>>>>>
> >>>>>> Let me again summarize the problem we are trying to address.
> >>>>>>
> >>>>>>
> >>>>> How will you take care of extra WAL on old master during
> recovery. If it
> >>>>> plays the WAL which has not reached new-master, it can be a
> problem.
> >>
> >>>> you means that there is possible that old master's data ahead of
> new
> >>>> master's data.
> >>
> >>> I mean to say is that WAL of old master can be ahead of new
> master. I understood that
> >>> data files of old master can't be ahead, but I think WAL can be
> ahead.
> >>
> >>>> so there is inconsistent data between those server when fail back.
> right?
> >>>> if so , there is not possible inconsistent. because if you use GUC
> option
> >>>> as his propose (i.g., failback_safe_standby_mode = remote_flush),
> >>>> when old master is working fine, all file system level changes
> aren't
> >>>> done before WAL replicated.
> >>
> >>> Would the propose patch will take care that old master's WAL is
> also not ahead in some way?
> >>> If yes, I think i am missing some point.
> >
> >> yes it will happen that old master's WAL ahead of new master's WAL
> as you said.
> >> but I think that we can solve them by delete all WAL file when old
> >> master starts as new standby.
> >
> > I think ideally, it should reset WAL location at the point where new
> master has forrked off.
> > In such a scenario it would be difficult for user who wants to get a
> dump of some data in
> > old master which hasn't gone to new master. I am not sure if such a
> need is there for real users, but if it
> > is there, then providing this solution will have some drawbacks.

> I think that we can dumping data before all WAL files deleting. All
> WAL files deleting is done when old master starts as new standby.

Can we dump data without starting server?

Sorry I made a mistake. We can't it.

this proposing patch need to be able to also handle such scenario in future.

Regards,

---

Sawada Masahiko

--
Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

20 June 2013, 06:33:01

On Wednesday, June 19, 2013 10:45 PM Sawada Masahiko wrote:
On Tuesday, June 18, 2013, Amit Kapila wrote:
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
> On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> > On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
> > On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
> <amit.kapila@huawei.com> wrote:
> >>
> >> On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
> >> On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
> <amit.kapila@huawei.com> wrote:
> >>> On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
> >>>> Hello,
> >>>

>>> I think that we can dumping data before all WAL files deleting.  All
>>> WAL files deleting is done when old master starts as new standby.

>>  Can we dump data without starting server?
>Sorry I made a mistake. We can't it.
> this proposing patch need to be able to also handle such scenario in future.

I am not sure the purposed patch can handle it so easily, but I think if others also felt it important, then a method
shouldbe a  
provided to user for extracting his last committed data.


With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

24 June 2013, 13:47:07

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>> I agree. We should probably find a better name for this. Any suggestions ?
>
> err, I already made one...
>
>>> But that's not the whole story. I can see some utility in a patch that
>>> makes all WAL transfer synchronous, rather than just commits. Some
>>> name like synchronous_transfer might be appropriate. e.g.
>>> synchronous_transfer = all | commit (default).
>
>> Since commits are more foreground in nature and this feature
>> does not require us to wait during common foreground activities, we want a
>> configuration where master can wait for synchronous transfers at other than
>> commits. May we can solve that by having more granular control to the said
>> parameter ?
>>
>>>
>>> The idea of another slew of parameters that are very similar to
>>> synchronous replication but yet somehow different seems weird. I can't
>>> see a reason why we'd want a second lot of parameters. Why not just
>>> use the existing ones for sync rep? (I'm surprised the Parameter
>>> Police haven't visited you in the night...) Sure, we might want to
>>> expand the design for how we specify multi-node sync rep, but that is
>>> a different patch.
>>
>>
>> How would we then distinguish between synchronous and the new kind of
>> standby ?
>
> That's not the point. The point is "Why would we have a new kind of
> standby?" and therefore why do we need new parameters?
>
>> I am told, one of the very popular setups for DR is to have one
>> local sync standby and one async (may be cascaded by the local sync). Since
>> this new feature is more useful for DR because taking a fresh backup on a
>> slower link is even more challenging, IMHO we should support such setups.
>
> ...which still doesn't make sense to me. Lets look at that in detail.
>
> Take 3 servers, A, B, C with A and B being linked by sync rep, and C
> being safety standby at a distance.
>
> Either A or B is master, except in disaster. So if A is master, then B
> would be the failover target. If A fails, then you want to failover to
> B. Once B is the target, you want to failback to A as the master. C
> needs to follow the new master, whichever it is.
>
> If you set up sync rep between A and B and this new mode between A and
> C. When B becomes the master, you need to failback from B from A, but
> you can't because the new mode applied between A and C only, so you
> have to failback from C to A. So having the new mode not match with
> sync rep means you are forcing people to failback using the slow link
> in the common case.
>
> You might observe that having the two modes match causes problems if A
> and B fail, so you are forced to go to C as master and then eventually
> failback to A or B across a slow link. That case is less common and
> could be solved by extending sync transfer to more/multi nodes.
>
> It definitely doesn't make sense to have sync rep on anything other
> than a subset of sync transfer. So while it may be sensible in the
> future to make sync transfer a superset of sync rep nodes, it makes
> sense to make them the same config for now.
>
when 2 servers being synchronous replication, those servers are in
same location in many cases. ( e.g., same server room)
so taking a full backup and sending it to old master is not issue.
this proposal works for situation which those servers are put in
remote location and when main site is powered down due to such as
power failure or natural disaster occurs.
as you said, we can control file (e.g., CLOG, pg_control, etc)
replicating by adding synchronous_transfer option.
but if to add only this parameter, we can handle only following 2 cases.

1. synchronous standby and make same as failback safe standby
2. asynchronous standby and make same as failback safe standby

in above case, adding new parameter might be meaningless. but I think
that we should handle case not only case 1,2  but also following case
3, 4 for DR.

3. synchronous standby and make different asynchronous failback safe standby
4. asynchronous standby and make different asynchronous failback safe standby

To handles following case 3 and 4, we should set parameter to each
standby. so we need to adding new parameter.
if we can structure replication in such situation, replication would
be more useful for user in slow link.

parameter improvement idea is which we extend ini file for to set
parameter each standby. For example :

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

there are discussions about such ini file in past. if so, we can set
each parameter to each standby.

please give me feedback.

Regards,
-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Langote

Date:

25 June 2013, 02:40:45

Hi,

> parameter improvement idea is which we extend ini file for to set
> parameter each standby. For example :
>
> --------------------
> [Server]
> standby_name = 'slave1'
> synchronous_transfer = commit
> wal_sender_timeout = 30
> [Server]
> standby_name = 'slave2'
> synchronous_transfer = all
> wal_sender_timeout = 50
> -------------------
>

Just ask to clarify:

Is 'slave2' a failback standby?
What does 'synchronous_transfer = all' mean? Does that mean wait
during both commit and checkpoint?



--
Amit Langote

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

25 June 2013, 03:19:37

On Mon, Jun 24, 2013 at 7:17 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

--------------------
[Server]
standby_name = 'slave1'
synchronous_transfer = commit
wal_sender_timeout = 30
[Server]
standby_name = 'slave2'
synchronous_transfer = all
wal_sender_timeout = 50
-------------------

What different values/modes you are thinking for synchronous_transfer ? IMHO only "commit" and "all" may not be enough. As I suggested upthread, we may need an additional mode, say "data", which will ensure synchronous WAL transfer before making any file system changes. We need this separate mode because the failback safe (or whatever we call it) standby need not wait on the commits and it's important to avoid that wait since it comes in a direct path of client transactions.

If we are doing it, I wonder if an additional mode "none" also makes sense so that users can also control asynchronous standbys via the same mechanism.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Amit Langote

Date:

25 June 2013, 04:53:17

Hi,

>
>> So our proposal on this problem is that we must ensure that master should
> not make any file system level changes without confirming that the
>> corresponding WAL record is replicated to the standby.
>
>   How will you take care of extra WAL on old master during recovery. If it
> plays the WAL which has not reached new-master, it can be a problem.
>

I am trying to understand how there would be extra WAL on old master
that it would replay and cause inconsistency. Consider how I am
picturing it and correct me if I am wrong.

1) Master crashes. So a failback standby becomes new master forking the WAL.
2) Old master is restarted as a standby (now with this patch, without
a new base backup).
3) It would try to replay all the WAL it has available and later
connect to the new master also following the timeline switch (the
switch might happen using archived WAL and timeline history file OR
the new switch-over-streaming-replication-connection as of 9.3,
right?)

* in (3), when the new standby/old master is replaying WAL, from where
is it picking the WAL? Does it first replay all the WAL in pg_xlog
before archive? Should we make it check for a timeline history file in
archive before it starts replaying any WAL?

* And, would the new master, before forking the WAL, replay all the
WAL that is necessary to come to state (of data directory) that the
old master was just before it crashed?

Am I missing something here?

--
Amit Langote

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

25 June 2013, 13:30:15

On Tue, Jun 25, 2013 at 12:19 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
>
> On Mon, Jun 24, 2013 at 7:17 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>>
>>
>> --------------------
>> [Server]
>> standby_name = 'slave1'
>> synchronous_transfer = commit
>> wal_sender_timeout = 30
>> [Server]
>> standby_name = 'slave2'
>> synchronous_transfer = all
>> wal_sender_timeout = 50
>> -------------------
>
>
> What different values/modes you are thinking for synchronous_transfer ? IMHO
> only "commit" and "all" may not be enough. As I suggested upthread, we may
> need an additional mode, say "data", which will ensure synchronous WAL
> transfer before making any file system changes. We need this separate mode
> because the failback safe (or whatever we call it) standby need not wait on
> the commits and it's important to avoid that wait since it comes in a direct
> path of client transactions.
>
> If we are doing it, I wonder if an additional mode "none" also makes sense
> so that users can also control asynchronous standbys via the same mechanism.

I made mistake how to use name of parameter between
synchronous_transfer and failback_safe_standby_mode.
it means that we control file system changes using
failback_safe_standby_mode. if failback_safe_standby_mode is set
'remote_flush', master server wait for flushing all data page in
standby server (e.g., CLOG, pg_control).
right?
for example:
------
[server]
standby_name = 'slave1'
failback_safe_standby_mode = remote_flush
wal_sender_timeout = 50
------

in this case, we should also set synchronous_commit and
synchronous_level to each standby server. that is, do we need to set
following 3 parameters for supporting case 3,4 as I said?
-synchronous_commit = on/off/local/remote_write
-failback_safe_standby_mode = off/remote_write/remote_flush
-synchronous_level = sync/async (this parameter means that standby
server is connected using which mode (sync/async) .)

please give me your feedback.

Regards,
-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

26 June 2013, 04:40:40

On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:
> Hi,
>
> >
> >> So our proposal on this problem is that we must ensure that master
> should
> > not make any file system level changes without confirming that the
> >> corresponding WAL record is replicated to the standby.
> >
> >   How will you take care of extra WAL on old master during recovery.
> If it
> > plays the WAL which has not reached new-master, it can be a problem.
> >
>
> I am trying to understand how there would be extra WAL on old master
> that it would replay and cause inconsistency. Consider how I am
> picturing it and correct me if I am wrong.
>
> 1) Master crashes. So a failback standby becomes new master forking the
> WAL.
> 2) Old master is restarted as a standby (now with this patch, without
> a new base backup).
> 3) It would try to replay all the WAL it has available and later
> connect to the new master also following the timeline switch (the
> switch might happen using archived WAL and timeline history file OR
> the new switch-over-streaming-replication-connection as of 9.3,
> right?)
>
> * in (3), when the new standby/old master is replaying WAL, from where
> is it picking the WAL?  Yes, this is the point which can lead to inconsistency, new standby/old master  will replay
WALafter the last successful checkpoint, for which he get info from  control file. It is picking WAL from the location
whereit was logged when it was active (pg_xlog). 

> Does it first replay all the WAL in pg_xlog
> before archive? Should we make it check for a timeline history file in
> archive before it starts replaying any WAL?

I have really not thought what is best solution for problem.

> * And, would the new master, before forking the WAL, replay all the
> WAL that is necessary to come to state (of data directory) that the
> old master was just before it crashed?

I don't think new master has any correlation with old master's data directory,
Rather it will replay the WAL it has received/flushed before start acting as master.

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

27 June 2013, 14:03:08

On Mon, Jun 24, 2013 at 10:47 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> 1. synchronous standby and make same as failback safe standby
> 2. asynchronous standby and make same as failback safe standby
>
> in above case, adding new parameter might be meaningless. but I think
> that we should handle case not only case 1,2  but also following case
> 3, 4 for DR.
>
to support case 1 and 2, I'm thinking that following another 2 ideas.

------------------------
we add synchronous_transfer( commit/ data_flush/ all) . This GUC will
only affect the standbys mentioned in the list of
synchronous_standby_names.

1. If synchronous_transfer is set to commit, current synchronous
replication behavior is achieved
2. If synchronous_transfer is set to data_flush, the standbys named in
synchronous_standby_names will act as ASYNC failback safe standbys
3. If synchronous_transfer is set to all, the standbys named in
synchronous_standby_names will act as SYNC failback safe standbys

in this approach, 3 is confusing because we are actually setting up a
ASYNC standby by using the GUCs meant for sync standby setup.

-------------------------
we extend synchronous_commit so that it also accepts like 'all'. (
this approach dosen't provide 'synchronous_transfer' parameter)
'all' value means that master wait for not only replicated WAL but
also replicated data page (e.g., CLOG, pg_control). and master changes
the process by whether standby is connected as sync or async.

1. If synchronous_commit is set to 'all' and synchronous_standby_name
is set to standby name, the standbys named in
synchronous_standby_names will act as SYNC failback safe standby.
2. If synchronous_commit is set to 'all' and synchronous_standby_name
is NOT set to standby name, the standbys which is connecting to masterwill act as ASYNC failback safe standby.

one problem with not naming ASYNC standby explicitly is that the
master has no clue which standby to wait on.
If it chooses to wait on all async standbys for failback-safety that
can be quite detrimental, especially because async standbys can become
easily unreachable if they are on a slow link or at remote location.

please give me feedback.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Robert Haas

Date:

27 June 2013, 18:22:48

On Mon, Jun 17, 2013 at 7:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I am told, one of the very popular setups for DR is to have one
>> local sync standby and one async (may be cascaded by the local sync). Since
>> this new feature is more useful for DR because taking a fresh backup on a
>> slower link is even more challenging, IMHO we should support such setups.
>
> ...which still doesn't make sense to me. Lets look at that in detail.
>
> Take 3 servers, A, B, C with A and B being linked by sync rep, and C
> being safety standby at a distance.
>
> Either A or B is master, except in disaster. So if A is master, then B
> would be the failover target. If A fails, then you want to failover to
> B. Once B is the target, you want to failback to A as the master. C
> needs to follow the new master, whichever it is.
>
> If you set up sync rep between A and B and this new mode between A and
> C. When B becomes the master, you need to failback from B from A, but
> you can't because the new mode applied between A and C only, so you
> have to failback from C to A. So having the new mode not match with
> sync rep means you are forcing people to failback using the slow link
> in the common case.

It's true that in this scenario that doesn't really make sense, but I
still think they are separate properties.  You could certainly want
synchronous replication without this new property, if you like the
data-loss guarantees that sync rep provides but don't care about
failback.  You could also want this new property without synchronous
replication, if you don't need the data-loss guarantees that sync rep
provides but you do care about fast failback.  I admit it seems
unlikely that you would use both features but not target them at the
same machines, although maybe: perhaps you have a sync standby and an
async standby and want this new property with respect to both of them.

In my admittedly limited experience, the use case for a lot of this
technology is in the cloud.  The general strategy seems to be: at the
first sign of trouble, kill the offending instance and fail over.
This can result in failing over pretty frequently, and needing it to
be fast.  There may be no real hardware problem; indeed, the failover
may be precipitated by network conditions or overload of the physical
host backing the virtual machine or any number of other nonphysical
problems.  I can see this being useful in that environment, even for
async standbys.  People can apparently tolerate a brief interruption
while their primary gets killed off and connections are re-established
with the new master, but they need the failover to be fast.  The
problem with the status quo is that, even if the first failover is
fast, the second one isn't, because it has to wait behind rebuilding
the original master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

28 June 2013, 05:11:04

On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:
>> Hi,
>>
>> >
>> >> So our proposal on this problem is that we must ensure that master
>> should
>> > not make any file system level changes without confirming that the
>> >> corresponding WAL record is replicated to the standby.
>> >
>> >   How will you take care of extra WAL on old master during recovery.
>> If it
>> > plays the WAL which has not reached new-master, it can be a problem.
>> >
>>
>> I am trying to understand how there would be extra WAL on old master
>> that it would replay and cause inconsistency. Consider how I am
>> picturing it and correct me if I am wrong.
>>
>> 1) Master crashes. So a failback standby becomes new master forking the
>> WAL.
>> 2) Old master is restarted as a standby (now with this patch, without
>> a new base backup).
>> 3) It would try to replay all the WAL it has available and later
>> connect to the new master also following the timeline switch (the
>> switch might happen using archived WAL and timeline history file OR
>> the new switch-over-streaming-replication-connection as of 9.3,
>> right?)
>>
>> * in (3), when the new standby/old master is replaying WAL, from where
>> is it picking the WAL?
>    Yes, this is the point which can lead to inconsistency, new standby/old master
>    will replay WAL after the last successful checkpoint, for which he get info from
>    control file. It is picking WAL from the location where it was logged when it was active (pg_xlog).
>
>> Does it first replay all the WAL in pg_xlog
>> before archive? Should we make it check for a timeline history file in
>> archive before it starts replaying any WAL?
>
> I have really not thought what is best solution for problem.
>
>> * And, would the new master, before forking the WAL, replay all the
>> WAL that is necessary to come to state (of data directory) that the
>> old master was just before it crashed?
>
> I don't think new master has any correlation with old master's data directory,
> Rather it will replay the WAL it has received/flushed before start acting as master.
when old master fail over, WAL which ahead of new master might be
broken data. so that when user want to dump from old master, there is
possible to fail dump.
it is just idea, we extend parameter which is used in recovery.conf
like 'follow_master_force'. this parameter accepts 'on' and 'off', is
effective only when standby_mode is set to on.

if both parameters 'follow_master_force' and 'standby_mode' is set to 'on',
1. when standby server starts and starts to recovery, standby server
skip to apply WAL which is in  pg_xlog, and request WAL from latest
checkpoint LSN to master server.
2. master server receives LSN which is standby server latest
checkpoint, and compare between LSN of standby and LSN of master
latest checkpoint. if those LSN match, master will send WAL from
latest checkpoint LSN. if not, master will inform standby that failed.
3. standby will fork WAL, and apply WAL which is sent from master continuity.

in this approach, user who want to dump from old master will set 'off'
to follow_master_force and standby_mode, and gets the dump of old
master after master started. OTOH, user who want to starts replication
force will set 'on' to both parameter.

please give me feedback.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

02 July 2013, 05:45:49

On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:
> On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
> wrote:
> > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:
> >> Hi,
> >>
> >> >
> >> >> So our proposal on this problem is that we must ensure that
> master
> >> should
> >> > not make any file system level changes without confirming that the
> >> >> corresponding WAL record is replicated to the standby.
> >> >
> >> >   How will you take care of extra WAL on old master during
> recovery.
> >> If it
> >> > plays the WAL which has not reached new-master, it can be a
> problem.
> >> >
> >>
> >> I am trying to understand how there would be extra WAL on old master
> >> that it would replay and cause inconsistency. Consider how I am
> >> picturing it and correct me if I am wrong.
> >>
> >> 1) Master crashes. So a failback standby becomes new master forking
> the
> >> WAL.
> >> 2) Old master is restarted as a standby (now with this patch,
> without
> >> a new base backup).
> >> 3) It would try to replay all the WAL it has available and later
> >> connect to the new master also following the timeline switch (the
> >> switch might happen using archived WAL and timeline history file OR
> >> the new switch-over-streaming-replication-connection as of 9.3,
> >> right?)
> >>
> >> * in (3), when the new standby/old master is replaying WAL, from
> where
> >> is it picking the WAL?
> >    Yes, this is the point which can lead to inconsistency, new
> standby/old master
> >    will replay WAL after the last successful checkpoint, for which he
> get info from
> >    control file. It is picking WAL from the location where it was
> logged when it was active (pg_xlog).
> >
> >> Does it first replay all the WAL in pg_xlog
> >> before archive? Should we make it check for a timeline history file
> in
> >> archive before it starts replaying any WAL?
> >
> > I have really not thought what is best solution for problem.
> >
> >> * And, would the new master, before forking the WAL, replay all the
> >> WAL that is necessary to come to state (of data directory) that the
> >> old master was just before it crashed?
> >
> > I don't think new master has any correlation with old master's data
> directory,
> > Rather it will replay the WAL it has received/flushed before start
> acting as master.
> when old master fail over, WAL which ahead of new master might be
> broken data. so that when user want to dump from old master, there is
> possible to fail dump.
> it is just idea, we extend parameter which is used in recovery.conf
> like 'follow_master_force'. this parameter accepts 'on' and 'off', is
> effective only when standby_mode is set to on.
>
> if both parameters 'follow_master_force' and 'standby_mode' is set to
> 'on',
> 1. when standby server starts and starts to recovery, standby server
> skip to apply WAL which is in  pg_xlog, and request WAL from latest
> checkpoint LSN to master server.
> 2. master server receives LSN which is standby server latest
> checkpoint, and compare between LSN of standby and LSN of master
> latest checkpoint. if those LSN match, master will send WAL from
> latest checkpoint LSN. if not, master will inform standby that failed.
> 3. standby will fork WAL, and apply WAL which is sent from master
> continuity.

Please consider if this solution has the same problem as mentioned by Robert Hass in below mail:
http://www.postgresql.org/message-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com


> in this approach, user who want to dump from old master will set 'off'
> to follow_master_force and standby_mode, and gets the dump of old
> master after master started. OTOH, user who want to starts replication
> force will set 'on' to both parameter.

I think before going into solution of this problem, it should be confirmed by others whether such a problem
needs to be resolved as part of this patch.

I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem.
So I think it's not worth inventing a solution.

Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer,
if he thinks that this problem needs to be solved as a separate patch.

With Regards,
Amit Kapila.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

02 July 2013, 13:14:11

On Tue, Jul 2, 2013 at 2:45 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
> On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:
>> On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
>> wrote:
>> > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:
>> >> Hi,
>> >>
>> >> >
>> >> >> So our proposal on this problem is that we must ensure that
>> master
>> >> should
>> >> > not make any file system level changes without confirming that the
>> >> >> corresponding WAL record is replicated to the standby.
>> >> >
>> >> >   How will you take care of extra WAL on old master during
>> recovery.
>> >> If it
>> >> > plays the WAL which has not reached new-master, it can be a
>> problem.
>> >> >
>> >>
>> >> I am trying to understand how there would be extra WAL on old master
>> >> that it would replay and cause inconsistency. Consider how I am
>> >> picturing it and correct me if I am wrong.
>> >>
>> >> 1) Master crashes. So a failback standby becomes new master forking
>> the
>> >> WAL.
>> >> 2) Old master is restarted as a standby (now with this patch,
>> without
>> >> a new base backup).
>> >> 3) It would try to replay all the WAL it has available and later
>> >> connect to the new master also following the timeline switch (the
>> >> switch might happen using archived WAL and timeline history file OR
>> >> the new switch-over-streaming-replication-connection as of 9.3,
>> >> right?)
>> >>
>> >> * in (3), when the new standby/old master is replaying WAL, from
>> where
>> >> is it picking the WAL?
>> >    Yes, this is the point which can lead to inconsistency, new
>> standby/old master
>> >    will replay WAL after the last successful checkpoint, for which he
>> get info from
>> >    control file. It is picking WAL from the location where it was
>> logged when it was active (pg_xlog).
>> >
>> >> Does it first replay all the WAL in pg_xlog
>> >> before archive? Should we make it check for a timeline history file
>> in
>> >> archive before it starts replaying any WAL?
>> >
>> > I have really not thought what is best solution for problem.
>> >
>> >> * And, would the new master, before forking the WAL, replay all the
>> >> WAL that is necessary to come to state (of data directory) that the
>> >> old master was just before it crashed?
>> >
>> > I don't think new master has any correlation with old master's data
>> directory,
>> > Rather it will replay the WAL it has received/flushed before start
>> acting as master.
>> when old master fail over, WAL which ahead of new master might be
>> broken data. so that when user want to dump from old master, there is
>> possible to fail dump.
>> it is just idea, we extend parameter which is used in recovery.conf
>> like 'follow_master_force'. this parameter accepts 'on' and 'off', is
>> effective only when standby_mode is set to on.
>>
>> if both parameters 'follow_master_force' and 'standby_mode' is set to
>> 'on',
>> 1. when standby server starts and starts to recovery, standby server
>> skip to apply WAL which is in  pg_xlog, and request WAL from latest
>> checkpoint LSN to master server.
>> 2. master server receives LSN which is standby server latest
>> checkpoint, and compare between LSN of standby and LSN of master
>> latest checkpoint. if those LSN match, master will send WAL from
>> latest checkpoint LSN. if not, master will inform standby that failed.
>> 3. standby will fork WAL, and apply WAL which is sent from master
>> continuity.
>
> Please consider if this solution has the same problem as mentioned by Robert Hass in below mail:
> http://www.postgresql.org/message-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com
>
>
>> in this approach, user who want to dump from old master will set 'off'
>> to follow_master_force and standby_mode, and gets the dump of old
>> master after master started. OTOH, user who want to starts replication
>> force will set 'on' to both parameter.
>
> I think before going into solution of this problem, it should be confirmed by others whether such a problem
> needs to be resolved as part of this patch.
>
> I have seen that Simon Riggs is a reviewer of this Patch and he hasn't mentioned his views about this problem.
> So I think it's not worth inventing a solution.
> Rather I think if all other things are resolved for this patch, then may be in end we can check with Committer,
> if he thinks that this problem needs to be solved as a separate patch.
thank you for feedback.
yes, we can consider separately those problem. and we need to judge
that whether it is worth to invent a solution.
I think that solving the fundamental of this problem is complex. it
might be needs to big change to architecture of replication.
so I'm thinking that I'd like to deal of something when we do
recovery. if so, I think that if we deal at recovery time, impact to
performance is ignored.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Amit Kapila

Date:

02 July 2013, 13:19:49

On Tuesday, July 02, 2013 11:16 AM Amit Kapila wrote:
> On Friday, June 28, 2013 10:41 AM Sawada Masahiko wrote:
> > On Wed, Jun 26, 2013 at 1:40 PM, Amit Kapila <amit.kapila@huawei.com>
> > wrote:
> > > On Tuesday, June 25, 2013 10:23 AM Amit Langote wrote:
> > >> Hi,
> > >>
> > >> >
> > >> >> So our proposal on this problem is that we must ensure that
> > master
> > >> should
> > >> > not make any file system level changes without confirming that
> the
> > >> >> corresponding WAL record is replicated to the standby.
> > >> >
> > >> >   How will you take care of extra WAL on old master during
> > recovery.
> > >> If it
> > >> > plays the WAL which has not reached new-master, it can be a
> > problem.
> > >> >
> > >>
> > >> I am trying to understand how there would be extra WAL on old
> master
> > >> that it would replay and cause inconsistency. Consider how I am
> > >> picturing it and correct me if I am wrong.
> > >>
> > >> 1) Master crashes. So a failback standby becomes new master
> forking
> > the
> > >> WAL.
> > >> 2) Old master is restarted as a standby (now with this patch,
> > without
> > >> a new base backup).
> > >> 3) It would try to replay all the WAL it has available and later
> > >> connect to the new master also following the timeline switch (the
> > >> switch might happen using archived WAL and timeline history file
> OR
> > >> the new switch-over-streaming-replication-connection as of 9.3,
> > >> right?)
> > >>
> > >> * in (3), when the new standby/old master is replaying WAL, from
> > where
> > >> is it picking the WAL?
> > >    Yes, this is the point which can lead to inconsistency, new
> > standby/old master
> > >    will replay WAL after the last successful checkpoint, for which
> he
> > get info from
> > >    control file. It is picking WAL from the location where it was
> > logged when it was active (pg_xlog).
> > >
> > >> Does it first replay all the WAL in pg_xlog
> > >> before archive? Should we make it check for a timeline history
> file
> > in
> > >> archive before it starts replaying any WAL?
> > >
> > > I have really not thought what is best solution for problem.
> > >
> > >> * And, would the new master, before forking the WAL, replay all
> the
> > >> WAL that is necessary to come to state (of data directory) that
> the
> > >> old master was just before it crashed?
> > >
> > > I don't think new master has any correlation with old master's data
> > directory,
> > > Rather it will replay the WAL it has received/flushed before start
> > acting as master.
> > when old master fail over, WAL which ahead of new master might be
> > broken data. so that when user want to dump from old master, there is
> > possible to fail dump.
> > it is just idea, we extend parameter which is used in recovery.conf
> > like 'follow_master_force'. this parameter accepts 'on' and 'off', is
> > effective only when standby_mode is set to on.
> >
> > if both parameters 'follow_master_force' and 'standby_mode' is set to
> > 'on',
> > 1. when standby server starts and starts to recovery, standby server
> > skip to apply WAL which is in  pg_xlog, and request WAL from latest
> > checkpoint LSN to master server.
> > 2. master server receives LSN which is standby server latest
> > checkpoint, and compare between LSN of standby and LSN of master
> > latest checkpoint. if those LSN match, master will send WAL from
> > latest checkpoint LSN. if not, master will inform standby that
> failed.
> > 3. standby will fork WAL, and apply WAL which is sent from master
> > continuity.
> 
> Please consider if this solution has the same problem as mentioned by
> Robert Hass in below mail:

Sorry typo error, it's Robert Haas mail:

> http://www.postgresql.org/message-
> id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com
> 
> 
> > in this approach, user who want to dump from old master will set
> 'off'
> > to follow_master_force and standby_mode, and gets the dump of old
> > master after master started. OTOH, user who want to starts
> replication
> > force will set 'on' to both parameter.
> 
> I think before going into solution of this problem, it should be
> confirmed by others whether such a problem
> needs to be resolved as part of this patch.
> 
> I have seen that Simon Riggs is a reviewer of this Patch and he hasn't
> mentioned his views about this problem.
> So I think it's not worth inventing a solution.
> 
> Rather I think if all other things are resolved for this patch, then
> may be in end we can check with Committer,
> if he thinks that this problem needs to be solved as a separate patch.

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

07 July 2013, 07:19:08

On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
>> I agree. We should probably find a better name for this. Any suggestions ?
>
> err, I already made one...
>
>>> But that's not the whole story. I can see some utility in a patch that
>>> makes all WAL transfer synchronous, rather than just commits. Some
>>> name like synchronous_transfer might be appropriate. e.g.
>>> synchronous_transfer = all | commit (default).
>
>> Since commits are more foreground in nature and this feature
>> does not require us to wait during common foreground activities, we want a
>> configuration where master can wait for synchronous transfers at other than
>> commits. May we can solve that by having more granular control to the said
>> parameter ?
>>
>>>
>>> The idea of another slew of parameters that are very similar to
>>> synchronous replication but yet somehow different seems weird. I can't
>>> see a reason why we'd want a second lot of parameters. Why not just
>>> use the existing ones for sync rep? (I'm surprised the Parameter
>>> Police haven't visited you in the night...) Sure, we might want to
>>> expand the design for how we specify multi-node sync rep, but that is
>>> a different patch.
>>
>>
>> How would we then distinguish between synchronous and the new kind of
>> standby ?
>
> That's not the point. The point is "Why would we have a new kind of
> standby?" and therefore why do we need new parameters?
>
>> I am told, one of the very popular setups for DR is to have one
>> local sync standby and one async (may be cascaded by the local sync). Since
>> this new feature is more useful for DR because taking a fresh backup on a
>> slower link is even more challenging, IMHO we should support such setups.
>
> ...which still doesn't make sense to me. Lets look at that in detail.
>
> Take 3 servers, A, B, C with A and B being linked by sync rep, and C
> being safety standby at a distance.
>
> Either A or B is master, except in disaster. So if A is master, then B
> would be the failover target. If A fails, then you want to failover to
> B. Once B is the target, you want to failback to A as the master. C
> needs to follow the new master, whichever it is.
>
> If you set up sync rep between A and B and this new mode between A and
> C. When B becomes the master, you need to failback from B from A, but
> you can't because the new mode applied between A and C only, so you
> have to failback from C to A. So having the new mode not match with
> sync rep means you are forcing people to failback using the slow link
> in the common case.
>
> You might observe that having the two modes match causes problems if A
> and B fail, so you are forced to go to C as master and then eventually
> failback to A or B across a slow link. That case is less common and
> could be solved by extending sync transfer to more/multi nodes.
>
> It definitely doesn't make sense to have sync rep on anything other
> than a subset of sync transfer. So while it may be sensible in the
> future to make sync transfer a superset of sync rep nodes, it makes
> sense to make them the same config for now.
I have updated the patch.

we support following 2 cases.
1. SYNC server and also make same failback safe standby server
2. ASYNC server and also make same failback safe standby server

1.  changed name of parameter give up 'failback_safe_standby_names' parameter from the first patch. and changed name of
parameterfrom 'failback_safe_mode ' to
 
'synchronous_transfer'. this parameter accepts 'all', 'data_flush' and 'commit'.
 -'commit'   'commit' means that master waits for corresponding WAL to flushed
to disk of standby server on commits.   but master doesn't waits for replicated data pages.
 -'data_flush'   'data_flush' means that master waits for replicated data page
(e.g, CLOG, pg_control) before flush to disk of master server.   but if user set to 'data_flush' to this parameter,
'synchronous_commit' values is ignored even if user set
'synchronous_commit'.
 -'all'   'all' means that master waits for replicated WAL and data page.

2. put SyncRepWaitForLSN() function into XLogFlush() function we have put SyncRepWaitForLSN() function into XLogFlush()
function,
and change argument of XLogFlush().

they are setup case and need to set parameters.

- SYNC server and also make same failback safe standgy server (case 1) synchronous_transfer = all synchronous_commit =
remote_write/onsynchronous_standby_names = <ServerName>
 

- ASYNC server and also make same failback safe standgy server (case 2) synchronous_transfer = data_flush
(synchronous_commitvalues is ignored)
 

- default SYNC replication synchronous_transfer = commit synchronous_commit = on synchronous_standby_names =
<ServerName>

- default ASYNC replication synchronous_transfer = commit

ToDo
1. currently this patch supports synchronous transfer. so we can't set
different synchronous transfer mode to each server.   we need to improve the patch for support following cases.  - SYNC
standbyand make separate ASYNC failback safe standby  - ASYNC standby and make separate ASYNC failback safe standby
 

2. we have not measure performance yet. we need to measure perfomance.

please give me your feedback.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

07 July 2013, 07:27:44

On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>>
>>> I agree. We should probably find a better name for this. Any suggestions ?
>>
>> err, I already made one...
>>
>>>> But that's not the whole story. I can see some utility in a patch that
>>>> makes all WAL transfer synchronous, rather than just commits. Some
>>>> name like synchronous_transfer might be appropriate. e.g.
>>>> synchronous_transfer = all | commit (default).
>>
>>> Since commits are more foreground in nature and this feature
>>> does not require us to wait during common foreground activities, we want a
>>> configuration where master can wait for synchronous transfers at other than
>>> commits. May we can solve that by having more granular control to the said
>>> parameter ?
>>>
>>>>
>>>> The idea of another slew of parameters that are very similar to
>>>> synchronous replication but yet somehow different seems weird. I can't
>>>> see a reason why we'd want a second lot of parameters. Why not just
>>>> use the existing ones for sync rep? (I'm surprised the Parameter
>>>> Police haven't visited you in the night...) Sure, we might want to
>>>> expand the design for how we specify multi-node sync rep, but that is
>>>> a different patch.
>>>
>>>
>>> How would we then distinguish between synchronous and the new kind of
>>> standby ?
>>
>> That's not the point. The point is "Why would we have a new kind of
>> standby?" and therefore why do we need new parameters?
>>
>>> I am told, one of the very popular setups for DR is to have one
>>> local sync standby and one async (may be cascaded by the local sync). Since
>>> this new feature is more useful for DR because taking a fresh backup on a
>>> slower link is even more challenging, IMHO we should support such setups.
>>
>> ...which still doesn't make sense to me. Lets look at that in detail.
>>
>> Take 3 servers, A, B, C with A and B being linked by sync rep, and C
>> being safety standby at a distance.
>>
>> Either A or B is master, except in disaster. So if A is master, then B
>> would be the failover target. If A fails, then you want to failover to
>> B. Once B is the target, you want to failback to A as the master. C
>> needs to follow the new master, whichever it is.
>>
>> If you set up sync rep between A and B and this new mode between A and
>> C. When B becomes the master, you need to failback from B from A, but
>> you can't because the new mode applied between A and C only, so you
>> have to failback from C to A. So having the new mode not match with
>> sync rep means you are forcing people to failback using the slow link
>> in the common case.
>>
>> You might observe that having the two modes match causes problems if A
>> and B fail, so you are forced to go to C as master and then eventually
>> failback to A or B across a slow link. That case is less common and
>> could be solved by extending sync transfer to more/multi nodes.
>>
>> It definitely doesn't make sense to have sync rep on anything other
>> than a subset of sync transfer. So while it may be sensible in the
>> future to make sync transfer a superset of sync rep nodes, it makes
>> sense to make them the same config for now.
> I have updated the patch.
>
> we support following 2 cases.
> 1. SYNC server and also make same failback safe standby server
> 2. ASYNC server and also make same failback safe standby server
>
> 1.  changed name of parameter
>   give up 'failback_safe_standby_names' parameter from the first patch.
>   and changed name of parameter from 'failback_safe_mode ' to
> 'synchronous_transfer'.
>   this parameter accepts 'all', 'data_flush' and 'commit'.
>
>   -'commit'
>     'commit' means that master waits for corresponding WAL to flushed
> to disk of standby server on commits.
>     but master doesn't waits for replicated data pages.
>
>   -'data_flush'
>     'data_flush' means that master waits for replicated data page
> (e.g, CLOG, pg_control) before flush to disk of master server.
>     but if user set to 'data_flush' to this parameter,
> 'synchronous_commit' values is ignored even if user set
> 'synchronous_commit'.
>
>   -'all'
>     'all' means that master waits for replicated WAL and data page.
>
> 2. put SyncRepWaitForLSN() function into XLogFlush() function
>   we have put SyncRepWaitForLSN() function into XLogFlush() function,
> and change argument of XLogFlush().
>
> they are setup case and need to set parameters.
>
> - SYNC server and also make same failback safe standgy server (case 1)
>   synchronous_transfer = all
>   synchronous_commit = remote_write/on
>   synchronous_standby_names = <ServerName>
>
> - ASYNC server and also make same failback safe standgy server (case 2)
>   synchronous_transfer = data_flush
>   (synchronous_commit values is ignored)
>
> - default SYNC replication
>   synchronous_transfer = commit
>   synchronous_commit = on
>   synchronous_standby_names = <ServerName>
>
> - default ASYNC replication
>   synchronous_transfer = commit
>
> ToDo
> 1. currently this patch supports synchronous transfer. so we can't set
> different synchronous transfer mode to each server.
>     we need to improve the patch for support following cases.
>    - SYNC standby and make separate ASYNC failback safe standby
>    - ASYNC standby and make separate ASYNC failback safe standby
>
> 2. we have not measure performance yet. we need to measure perfomance.
>
> please give me your feedback.
>
> Regards,
>
> -------
> Sawada Masahiko

I'm sorry. I forgot attached the patch.
Please see the attached file.

Regards,

-------
Sawada Masahiko

Attachment

failback_safe_standby_v2.patch

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

09 July 2013, 14:46:08

On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Sun, Jul 7, 2013 at 4:19 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Mon, Jun 17, 2013 at 8:48 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 17 June 2013 09:03, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>>>
>>>> I agree. We should probably find a better name for this. Any suggestions ?
>>>
>>> err, I already made one...
>>>
>>>>> But that's not the whole story. I can see some utility in a patch that
>>>>> makes all WAL transfer synchronous, rather than just commits. Some
>>>>> name like synchronous_transfer might be appropriate. e.g.
>>>>> synchronous_transfer = all | commit (default).
>>>
>>>> Since commits are more foreground in nature and this feature
>>>> does not require us to wait during common foreground activities, we want a
>>>> configuration where master can wait for synchronous transfers at other than
>>>> commits. May we can solve that by having more granular control to the said
>>>> parameter ?
>>>>
>>>>>
>>>>> The idea of another slew of parameters that are very similar to
>>>>> synchronous replication but yet somehow different seems weird. I can't
>>>>> see a reason why we'd want a second lot of parameters. Why not just
>>>>> use the existing ones for sync rep? (I'm surprised the Parameter
>>>>> Police haven't visited you in the night...) Sure, we might want to
>>>>> expand the design for how we specify multi-node sync rep, but that is
>>>>> a different patch.
>>>>
>>>>
>>>> How would we then distinguish between synchronous and the new kind of
>>>> standby ?
>>>
>>> That's not the point. The point is "Why would we have a new kind of
>>> standby?" and therefore why do we need new parameters?
>>>
>>>> I am told, one of the very popular setups for DR is to have one
>>>> local sync standby and one async (may be cascaded by the local sync). Since
>>>> this new feature is more useful for DR because taking a fresh backup on a
>>>> slower link is even more challenging, IMHO we should support such setups.
>>>
>>> ...which still doesn't make sense to me. Lets look at that in detail.
>>>
>>> Take 3 servers, A, B, C with A and B being linked by sync rep, and C
>>> being safety standby at a distance.
>>>
>>> Either A or B is master, except in disaster. So if A is master, then B
>>> would be the failover target. If A fails, then you want to failover to
>>> B. Once B is the target, you want to failback to A as the master. C
>>> needs to follow the new master, whichever it is.
>>>
>>> If you set up sync rep between A and B and this new mode between A and
>>> C. When B becomes the master, you need to failback from B from A, but
>>> you can't because the new mode applied between A and C only, so you
>>> have to failback from C to A. So having the new mode not match with
>>> sync rep means you are forcing people to failback using the slow link
>>> in the common case.
>>>
>>> You might observe that having the two modes match causes problems if A
>>> and B fail, so you are forced to go to C as master and then eventually
>>> failback to A or B across a slow link. That case is less common and
>>> could be solved by extending sync transfer to more/multi nodes.
>>>
>>> It definitely doesn't make sense to have sync rep on anything other
>>> than a subset of sync transfer. So while it may be sensible in the
>>> future to make sync transfer a superset of sync rep nodes, it makes
>>> sense to make them the same config for now.
>> I have updated the patch.
>>
>> we support following 2 cases.
>> 1. SYNC server and also make same failback safe standby server
>> 2. ASYNC server and also make same failback safe standby server
>>
>> 1.  changed name of parameter
>>   give up 'failback_safe_standby_names' parameter from the first patch.
>>   and changed name of parameter from 'failback_safe_mode ' to
>> 'synchronous_transfer'.
>>   this parameter accepts 'all', 'data_flush' and 'commit'.
>>
>>   -'commit'
>>     'commit' means that master waits for corresponding WAL to flushed
>> to disk of standby server on commits.
>>     but master doesn't waits for replicated data pages.
>>
>>   -'data_flush'
>>     'data_flush' means that master waits for replicated data page
>> (e.g, CLOG, pg_control) before flush to disk of master server.
>>     but if user set to 'data_flush' to this parameter,
>> 'synchronous_commit' values is ignored even if user set
>> 'synchronous_commit'.
>>
>>   -'all'
>>     'all' means that master waits for replicated WAL and data page.
>>
>> 2. put SyncRepWaitForLSN() function into XLogFlush() function
>>   we have put SyncRepWaitForLSN() function into XLogFlush() function,
>> and change argument of XLogFlush().
>>
>> they are setup case and need to set parameters.
>>
>> - SYNC server and also make same failback safe standgy server (case 1)
>>   synchronous_transfer = all
>>   synchronous_commit = remote_write/on
>>   synchronous_standby_names = <ServerName>
>>
>> - ASYNC server and also make same failback safe standgy server (case 2)
>>   synchronous_transfer = data_flush
>>   (synchronous_commit values is ignored)
>>
>> - default SYNC replication
>>   synchronous_transfer = commit
>>   synchronous_commit = on
>>   synchronous_standby_names = <ServerName>
>>
>> - default ASYNC replication
>>   synchronous_transfer = commit
>>
>> ToDo
>> 1. currently this patch supports synchronous transfer. so we can't set
>> different synchronous transfer mode to each server.
>>     we need to improve the patch for support following cases.
>>    - SYNC standby and make separate ASYNC failback safe standby
>>    - ASYNC standby and make separate ASYNC failback safe standby
>>
>> 2. we have not measure performance yet. we need to measure perfomance.
>>
>> please give me your feedback.
>>
>> Regards,
>>
>> -------
>> Sawada Masahiko
>
> I'm sorry. I forgot attached the patch.
> Please see the attached file.
>
> Regards,
>
> -------
> Sawada Masahiko
I found a bug which occurred when we do vacuum, and have fixed it.
yesterday (8th July) "Improve scalability of WAL insertions" patch is
committed to HEAD. so v2 patch does not apply to HEAD now.
I also have fixed it to be applicable to HEAD.

please find the attached patch.

Regards,

-------
Sawada Masahiko

Attachment

failback_safe_standby_v3.patch

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

11 July 2013, 14:42:17

On Tue, Jul 9, 2013 at 11:45 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Sun, Jul 7, 2013 at 4:27 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> I found a bug which occurred when we do vacuum, and have fixed it.
> yesterday (8th July) "Improve scalability of WAL insertions" patch is
> committed to HEAD. so v2 patch does not apply to HEAD now.
> I also have fixed it to be applicable to HEAD.
>
> please find the attached patch.
>
> Regards,
>
> -------
> Sawada Masahiko

I have fixed that master server doesn't waits for the WAL to be
flushed to disk of standby when master server execute FlushBuffer(),
and have attached v4 patch.
please find the attached patch.

Regards,

-------
Sawada Masahiko

Attachment

failback_safe_standby_v4.patch

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

15 July 2013, 13:32:43

> ToDo
> 1. currently this patch supports synchronous transfer. so we can't set
> different synchronous transfer mode to each server.
>     we need to improve the patch for support following cases.
>    - SYNC standby and make separate ASYNC failback safe standby
>    - ASYNC standby and make separate ASYNC failback safe standby
>
> 2. we have not measure performance yet. we need to measure perfomance.

Here are the tests results, showing the performance overhead of the
patch ( failback_safe_standby_v4.patch ):

Tests are carried out in two different scenarios:
1.  Tests with Fast transaction workloads.
2.  Test with large loads.

Test Type-1: Tests with pgbench  (*Tests with fast transaction workloads*)

Notes:
1. These test are for testing the performance overhead caused by the
patch for fast transaction workloads.
2. Tests are performed with the pgbench benchmark and performance
measurement factor is the TPS value.
3. Values represents the TPS for 4 runs,and the last value represents
the average of all the runs.

Settings for tests:
transaction type: TPC-B (sort of)
scaling factor:     300
query mode: simple
number of clients:150
number of threads:1
duration:1800 s

Analysis of results:
1) Synchronous Replication :(753.06, 748.81,     748.38, 747.21, Avg- 747.2)    
2) Synchronous Replication + Failsafe standby (commit) :
(729.13,724.33 , 713.59 , 710.79, Avg- 719.46)
3) Synchronous Replication + Failsafe standby (all) : (692.08, 688.08,
711.23,711.62, Avg- 700.75)

4) Asynchronous Replication :(1008.42,     993.39, 986.80 ,1028.46 ,
Avg-1004.26 )
5) Asynchronous Replication + Failsafe standby (commit) : (974.49,
978.60 ,969.11, 957.18 , Avg- 969.84)
6) Asynchronous Replication + Failsafe standby (data_flush)    :
(1011.79, 992.05, 1030.20,940.50 , Avg- 993.63)

In above test results the performance numbers are very close to each
other, also because of noise they show variation. hence following is
approximate conclusion about the  overhead of patch.

1. Streaming replication + synchronous_transfer (all , data_flush)::
a)  On an average, synchronous replication combined with
synchronous_transfer (all ) causes 6.21 %  performance overhead.
b)  On an average, asynchronous streaming replication combined
synchronous_transfer (data_flush ) causes averagely 1.05 % performance
overhead.

2. Streaming replication + synchronous_transfer (commit):
a) On an average, synchronous replication combined with
synchronous_transfer (commit )  causes 3.71 % performance overhead.
b) On an average,  asynchronous streaming replication combined with
synchronous_transfer (commit)  causes averagely 3.42 % performance
overhead.


Test Type-2: Tests with pgbench -i (*Tests with large loads:*)

Notes:
1. These test are for testing the performance overhead caused by the
patch for large loads and index builds.
2. Tests are performed with the pgbench -i (initialization of test
data  i.e the time taken for creating tables of pgbench, inserting
tuples and building primary keys.)
3. The performance measurement factor is the wall clock time for
pgbench -i (measured with time command).
4. Values represents the Wall clock time for 4 runs and the last value
represents the average of all the runs.

pgbench settings:
Scale factor:  300 ( Database size - 4.3873 GB)

Test results:
1) Synchronous Replication : (126.98,  133.83,  127.77,  129.70,
Avg-129.57)  (second)
2) Synchronous Replication + synchronous_transfer (commit) : (132.87,
125.85,  133.91,  134.61,  Avg-131.81) (second)
3) Synchronous Replication + synchronous_transfer (all) : (133.59,
132.82, 134.20, 135.22,  Avg-133.95) (second)

4) Asynchronous Replication : ( 126.75 ,     136.95, 130.42,
127.77,    130.47) (second)
5) Asynchronous Replication + synchronous_transfer (commit) : (128.13,
133.06, 127.62, 130.70, Avg-129.87) (second)
6) Asynchronous Replication + synchronous_transfer (data_flush) :
(134.55    , 139.90, 144.47, 143.85, Avg-140.69) (second)

In above test results the performance numbers are very close to each
other, also because of noise they show variation. hence following is
approximate conclusion about the  overhead of patch.

1. Streaming replication + synchronous_transfer (all , data_flush)::
a)  On an average, synchronous replication combined with
synchronous_transfer (all ) causes 3.38 %  performance overhead.
b)  On an average, asynchronous streaming replication combined
synchronous_transfer (data_flush ) causes averagely 7.83 % performance
overhead.

2. Streaming replication + synchronous_transfer (commit):
a) On an average, synchronous replication combined with
synchronous_transfer (commit )  causes 1.72 % performance overhead.
b) On an average,  asynchronous streaming replication combined with
synchronous_transfer (commit)  causes averagely (-0.45) % performance
overhead.

The test results for both the cases (large loads and fast
transactions) shows variation because of noise, But we can observe
that approximately patch causes 3-4% performance overhead.


Regards,
Samrat Revgade

Re: Patch for fail-back without fresh backup

From

Peter Eisentraut

Date:

24 August 2013, 14:38:53

On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote:
> please find the attached patch.

Please fix these compiler warnings:

xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration]
syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

26 August 2013, 02:46:02

On Sat, Aug 24, 2013 at 11:38 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Thu, 2013-07-11 at 23:42 +0900, Sawada Masahiko wrote:
>> please find the attached patch.
>
> Please fix these compiler warnings:
>
> xlog.c:3117:2: warning: implicit declaration of function ‘SyncRepWaitForLSN’ [-Wimplicit-function-declaration]
> syncrep.c:414:6: warning: variable ‘numdataflush’ set but not used [-Wunused-but-set-variable]
Thank you for your information!

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit the patch

--
Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

12 September 2013, 07:01:00

We are improving the patch for Commit Fest 2 now.
We will fix above compiler warnings as soon as possible and submit the patch

Attached *synchronous_transfer_v5.patch* implements review comments from commit fest-1 and reduces the performance overhead of synchronous_transfer.

*synchronous_transfer_documentation_v1.patch* adds fail back safe standby mechanism in the PostgreSQL documentation.

Sawada-san worked very hard to get this thing done, most of the work is done by him. I really appreciate his efforts :)

**Brief description of suggestions from commit fest -1 are as follows:

1) Fail back-safe standby is not an appropriate name [done - changed it to synchronous_transfer].

2) Remove extra set of postgresql.conf parameters [ done - Now there is only one additional postgresql.conf parameter *synchronous_transfer* which controls the synchronous nature of WAL transfer].

3) Performance overhead measurement [ done- with fast transaction workloads and large loads index builds ].

4) Put the SyncRepWaitForLSN inside XLogFlush [ Not the correct way - as SyncRepWaitForLSN will go in critical section and any error ( it will do network I/O and may also sleep) inside critical section leads to server PANIC and restart ].

5) Old master's WAL ahead of new master's WAL - [ we overcome with this by deleting all WAL files of old master details can be found here : https://wiki.postgresql.org/wiki/Synchronous_Transfer]

**Changes to postgresql.conf to configure fail back safe standby:

1) Synchronous fail-back safe standby

synchronous_standby_names = <server name>

synchronous_transfer = all

2) Asynchronous fail-back safe standby

synchronous_standby_names = <server name>

synchronous_transfer = data_flush

3) Pure synchronous standby

synchronous_standby_names = <server name>

synchronous_transfer = commit

4) Pure asynchronous standby

synchronous_transfer = commit

**Restriction:

If multiple standby servers connect to the master, then the standby with synchronous replication becomes failback safe standby.

for example: if there are 2 standby servers which connects to master server (one is SYNC, another one is ASYNC) and synchronous_transfer is set 'all'.

Then SYNC standby becomes failback safe standby and master server will wait only for SYNC standby server.

**Performance overhead of synchronous_transfer patch:

Tests are performed with pgbench benchmark with following configuration options:

Transaction type: TPC-B (sort of)

Scaling factor: 300

Query mode: simple

Number of clients: 150

Number of threads: 1

Duration: 1800 s

Real time scenarios mostly based on fast transaction workloads for which synchronous_transfer have negligible overhead.

** 1. Test for fast transaction workloads [measured w.r.t default replication in PostgreSQL, pgbench benchmark - TPS value]:

a. On an average performance overhead caused by synchronous standby: 0.0102 %.

b. On an average performance overhead caused by synchronous failback safe standby: 0.2943 %.

c. On an average performance overhead caused by Asynchronous standby: 0.04321 %.

d. On an average performance overhead caused by asynchronous failback safe standby: 0.5141 %

**2. Test for large loads and index builds [measured w.r.t default replication in PostgreSQL, pgbench benchmark (-i option) - time in seconds]:

a. On an average performance overhead caused by synchronous standby: 3.51 %.

b. On an average performance overhead caused by synchronous failback safe standby: 14.88%.

c. On an average performance overhead caused by Asynchronous standby: 0.4887%.

d. On an average performance overhead caused by asynchronous failback safe standby: 10.19%

**TO-DO:

More discussion is needed regarding usefulness/need and priority on following. any feedback is appreciated:

1. Support for multiple fail back safe standbys.

2. Current design of patch will wait forever for the failback safe standby like Streaming replication.

3. Support for cascaded failback standby

---

Regards,

Samrat Revagade

Attachment

Re: Patch for fail-back without fresh backup

From

Peter Eisentraut

Date:

12 September 2013, 16:11:46

On 9/12/13 3:00 AM, Samrat Revagade wrote:
> 
>     We are improving the patch for Commit Fest 2 now.
>     We will fix above compiler warnings as soon as possible and submit
>     the patch
> 
> 
> Attached *synchronous_transfer_v5.patch* implements review comments from
> commit fest-1 and reduces the performance overhead of synchronous_transfer.

There is still this compiler warning:

syncrep.c: In function ‘SyncRepReleaseWaiters’:
syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
[-Wunused-but-set-variable]

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

13 September 2013, 01:17:34

On Fri, Sep 13, 2013 at 1:11 AM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 9/12/13 3:00 AM, Samrat Revagade wrote:
>>
>>     We are improving the patch for Commit Fest 2 now.
>>     We will fix above compiler warnings as soon as possible and submit
>>     the patch
>>
>>
>> Attached *synchronous_transfer_v5.patch* implements review comments from
>> commit fest-1 and reduces the performance overhead of synchronous_transfer.
>
> There is still this compiler warning:
>
> syncrep.c: In function ‘SyncRepReleaseWaiters’:
> syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
> [-Wunused-but-set-variable]
>

Sorry I forgot fix it.

I have attached the patch which I modified.

Regards,

-------
Sawada Masahiko

Attachment

synchronous_transfer_v6.patch

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

17 September 2013, 06:46:03

> syncrep.c: In function ‘SyncRepReleaseWaiters’:
> syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
> [-Wunused-but-set-variable]
>

Sorry I forgot fix it.

I have attached the patch which I modified.

Attached patch combines documentation patch and source-code patch.

Regards,

Samrat Revgade

Attachment

synchronous_transfer_v7.patch

Re: Patch for fail-back without fresh backup

From

Fujii Masao

Date:

17 September 2013, 12:53:08

On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
>
>> > syncrep.c: In function ‘SyncRepReleaseWaiters’:
>> > syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
>> > [-Wunused-but-set-variable]
>> >
>>
>> Sorry I forgot fix it.
>>
>> I have attached the patch which I modified.
>>
>
> Attached patch combines documentation patch and source-code patch.

I set up synchronous replication with synchronous_transfer = all, and then I ran
pgbench -i and executed CHECKPOINT in the master. After that, when I executed
CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
synchronous_transfer feature.

How does synchronous_transfer work with cascade replication? If it's set to all
in the "sender-side" standby, it can resolve the data page inconsistency between
two standbys?

Regards,

--
Fujii Masao

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

18 September 2013, 01:35:58

On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Sep 17, 2013 at 3:45 PM, Samrat Revagade
> <revagade.samrat@gmail.com> wrote:
>>
>>> > syncrep.c: In function ‘SyncRepReleaseWaiters’:
>>> > syncrep.c:421:6: warning: variable ‘numdataflush’ set but not used
>>> > [-Wunused-but-set-variable]
>>> >
>>>
>>> Sorry I forgot fix it.
>>>
>>> I have attached the patch which I modified.
>>>
>>
>> Attached patch combines documentation patch and source-code patch.
>
> I set up synchronous replication with synchronous_transfer = all, and then I ran
> pgbench -i and executed CHECKPOINT in the master. After that, when I executed
> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
> synchronous_transfer feature.

Did you set synchronous_standby_names in the standby server?
If so, the master server waits for the standby server which is set to
synchronous_standby_names.
Please let me know detail of this case.

>
> How does synchronous_transfer work with cascade replication? If it's set to all
> in the "sender-side" standby, it can resolve the data page inconsistency between
> two standbys?
>

Currently patch supports the case which two servers are set up SYNC replication.
IWO, failback safe standby is the same as SYNC replication standby.
User can set synchronous_transfer in only master side.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Fujii Masao

Date:

18 September 2013, 02:45:35

On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> I set up synchronous replication with synchronous_transfer = all, and then I ran
>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed
>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
>> synchronous_transfer feature.
>
> Did you set synchronous_standby_names in the standby server?

Yes.

> If so, the master server waits for the standby server which is set to
> synchronous_standby_names.
> Please let me know detail of this case.

Both master and standby have the same postgresql.conf settings as follows:
   max_wal_senders = 4   wal_level = hot_standby   wal_keep_segments = 32   synchronous_standby_names = '*'
synchronous_transfer= all
 

>> How does synchronous_transfer work with cascade replication? If it's set to all
>> in the "sender-side" standby, it can resolve the data page inconsistency between
>> two standbys?
>>
>
> Currently patch supports the case which two servers are set up SYNC replication.
> IWO, failback safe standby is the same as SYNC replication standby.
> User can set synchronous_transfer in only master side.

So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.

Regards,

-- 
Fujii Masao

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

18 September 2013, 04:06:03

On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> I set up synchronous replication with synchronous_transfer = all, and then I ran
>>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed
>>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
>>> synchronous_transfer feature.
>>
>> Did you set synchronous_standby_names in the standby server?
>
> Yes.
>
>> If so, the master server waits for the standby server which is set to
>> synchronous_standby_names.
>> Please let me know detail of this case.
>
> Both master and standby have the same postgresql.conf settings as follows:
>
>     max_wal_senders = 4
>     wal_level = hot_standby
>     wal_keep_segments = 32
>     synchronous_standby_names = '*'
>     synchronous_transfer = all
>
>>> How does synchronous_transfer work with cascade replication? If it's set to all
>>> in the "sender-side" standby, it can resolve the data page inconsistency between
>>> two standbys?
>>>
>>
>> Currently patch supports the case which two servers are set up SYNC replication.
>> IWO, failback safe standby is the same as SYNC replication standby.
>> User can set synchronous_transfer in only master side.
>
> So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.
>

yes I think so.
I was not considering that user set synchronous_standby_names in the
standby server.
it will ocurr
I will fix it considering this case.



Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

18 September 2013, 04:19:12

On Wed, Sep 18, 2013 at 1:05 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>> I set up synchronous replication with synchronous_transfer = all, and then I ran
>>>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed
>>>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
>>>> synchronous_transfer feature.
>>>
>>> Did you set synchronous_standby_names in the standby server?
>>
>> Yes.
>>
>>> If so, the master server waits for the standby server which is set to
>>> synchronous_standby_names.
>>> Please let me know detail of this case.
>>
>> Both master and standby have the same postgresql.conf settings as follows:
>>
>>     max_wal_senders = 4
>>     wal_level = hot_standby
>>     wal_keep_segments = 32
>>     synchronous_standby_names = '*'
>>     synchronous_transfer = all
>>
>>>> How does synchronous_transfer work with cascade replication? If it's set to all
>>>> in the "sender-side" standby, it can resolve the data page inconsistency between
>>>> two standbys?
>>>>
>>>
>>> Currently patch supports the case which two servers are set up SYNC replication.
>>> IWO, failback safe standby is the same as SYNC replication standby.
>>> User can set synchronous_transfer in only master side.
>>
>> So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.
>>

Sorry I sent mail by mistake.

yes I think so.

It waits for corresponding WAL replicated.
Behaviour of synchronous_transfer is similar to
synchronous_standby_names and synchronous replication little.
That is, if  those parameter is set but the standby server doesn't
connect to the master server,
the master server waits for corresponding WAL replicated to standby
server infinitely.

I was not considering that user set synchronous_standby_names in the
standby server.
I will fix it considering this case.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

19 September 2013, 02:48:43

On Wed, Sep 18, 2013 at 11:45 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Wed, Sep 18, 2013 at 10:35 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Tue, Sep 17, 2013 at 9:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> I set up synchronous replication with synchronous_transfer = all, and then I ran
>>> pgbench -i and executed CHECKPOINT in the master. After that, when I executed
>>> CHECKPOINT in the standby, it got stuck infinitely. I guess this was cased by
>>> synchronous_transfer feature.
>>
>> Did you set synchronous_standby_names in the standby server?
>
> Yes.
>
>> If so, the master server waits for the standby server which is set to
>> synchronous_standby_names.
>> Please let me know detail of this case.
>
> Both master and standby have the same postgresql.conf settings as follows:
>
>     max_wal_senders = 4
>     wal_level = hot_standby
>     wal_keep_segments = 32
>     synchronous_standby_names = '*'
>     synchronous_transfer = all
>
>>> How does synchronous_transfer work with cascade replication? If it's set to all
>>> in the "sender-side" standby, it can resolve the data page inconsistency between
>>> two standbys?
>>>
>>
>> Currently patch supports the case which two servers are set up SYNC replication.
>> IWO, failback safe standby is the same as SYNC replication standby.
>> User can set synchronous_transfer in only master side.
>
> So, it's very strange that CHECKPOINT on the standby gets stuck infinitely.
>

I attached the patch which I have modified.

I have modified that if both synchronous replication and synchronous
transfer are requested,
but the server still in recovery(i.g., the server is in standby mode),
the server doesn't wait for
corresponding WAL replicated.
Specifically, I added condition RecoveryInProgress().

If both functions(synchronous replication and transfer) are set and
user sets up synchronous replication between two servers,
user can executes CHECKPOINT on standby side. It will not wait for
corresponding WAL replicated.
But, If both parameter are set and user doesn't set up synchronous
replication(i.g., the master server works alone),
the master server waits infinitely when user executes CHECKPOINT. This
behaviour is similar to synchronous replication.

Regards,

-------
Sawada Masahiko

Attachment

synchronous_transfer_v8.patch

Re: Patch for fail-back without fresh backup

From

Fujii Masao

Date:

19 September 2013, 03:25:17

On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> I attached the patch which I have modified.

Thanks for updating the patch!

Here are the review comments:

I got the compiler warning:
   syncrep.c:112: warning: unused variable 'i'

How does synchronous_transfer work with synchronous_commit?

+ * accept all the likely variants of "off".

This comment should be removed because synchronous_transfer
doesn't accept the value "off".

+    {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},

ISTM the third value "true" should be "false".

+    {"0", SYNCHRONOUS_TRANSFER_COMMIT, true},

Why is this needed?

+        elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <=
sentPtr(%X/%X) AND sendTImeLine",
+             (uint32) (sendTimeLineValidUpto >> 32), (uint32)
sendTimeLineValidUpto,
+             (uint32) (sentPtr >> 32), (uint32) sentPtr);

Why is this needed?

+#define SYNC_REP_WAIT_FLUSH                    1
+#define SYNC_REP_WAIT_DATA_FLUSH    2

Why do we need to separate the wait-queue for wait-data-flush
from that for wait-flush? ISTM that wait-data-flush also can
wait for the replication on the wait-queue for wait-flush, and
which would simplify the patch.

Regards,

-- 
Fujii Masao

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

19 September 2013, 10:07:25

On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> I attached the patch which I have modified.
>
> Thanks for updating the patch!
>
> Here are the review comments:
>

Thank you for reviewing!

> I got the compiler warning:
>
>     syncrep.c:112: warning: unused variable 'i'
>
> How does synchronous_transfer work with synchronous_commit?

The currently patch synchronous_transfer doesn't work when
synchronous_commit is set 'off' or 'local'.
if user changes synchronous_commit value on transaction, checkpointer
process can't see it.
Due to that, even if synchronous_commit is changed to 'off' from 'on',
synchronous_transfer doesn't work.
I'm planning to modify the patch so that synchronous_transfer is not
affected by synchronous_commit.

>
> + * accept all the likely variants of "off".
>
> This comment should be removed because synchronous_transfer
> doesn't accept the value "off".
>
> +    {"commit", SYNCHRONOUS_TRANSFER_COMMIT, true},
>
> ISTM the third value "true" should be "false".
>
> +    {"0", SYNCHRONOUS_TRANSFER_COMMIT, true},
>
> Why is this needed?
>
> +        elog(WARNING, "XLogSend sendTimeLineValidUpto(%X/%X) <=
> sentPtr(%X/%X) AND sendTImeLine",
> +             (uint32) (sendTimeLineValidUpto >> 32), (uint32)
> sendTimeLineValidUpto,
> +             (uint32) (sentPtr >> 32), (uint32) sentPtr);
>
> Why is this needed?
>

They are unnecessary. I had forgot to remove unnecessary codes.

> +#define SYNC_REP_WAIT_FLUSH                    1
> +#define SYNC_REP_WAIT_DATA_FLUSH    2
>
> Why do we need to separate the wait-queue for wait-data-flush
> from that for wait-flush? ISTM that wait-data-flush also can
> wait for the replication on the wait-queue for wait-flush, and
> which would simplify the patch.
>

Yes, it seems not necessary to add queue newly.
I will delete SYNC_REP_WAIT_DATA_FLUSH and related that.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Fujii Masao

Date:

19 September 2013, 10:32:18

On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>> I attached the patch which I have modified.
>>
>> Thanks for updating the patch!
>>
>> Here are the review comments:
>>
>
> Thank you for reviewing!
>
>> I got the compiler warning:
>>
>>     syncrep.c:112: warning: unused variable 'i'
>>
>> How does synchronous_transfer work with synchronous_commit?
>
> The currently patch synchronous_transfer doesn't work when
> synchronous_commit is set 'off' or 'local'.
> if user changes synchronous_commit value on transaction, checkpointer
> process can't see it.
> Due to that, even if synchronous_commit is changed to 'off' from 'on',
> synchronous_transfer doesn't work.
> I'm planning to modify the patch so that synchronous_transfer is not
> affected by synchronous_commit.

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

Regards,

-- 
Fujii Masao

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

19 September 2013, 13:41:55

On Thu, Sep 19, 2013 at 7:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Sep 19, 2013 at 7:07 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Thu, Sep 19, 2013 at 12:25 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> On Thu, Sep 19, 2013 at 11:48 AM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>>>> I attached the patch which I have modified.
>>>
>>> Thanks for updating the patch!
>>>
>>> Here are the review comments:
>>>
>>
>> Thank you for reviewing!
>>
>>> I got the compiler warning:
>>>
>>>     syncrep.c:112: warning: unused variable 'i'
>>>
>>> How does synchronous_transfer work with synchronous_commit?
>>
>> The currently patch synchronous_transfer doesn't work when
>> synchronous_commit is set 'off' or 'local'.
>> if user changes synchronous_commit value on transaction, checkpointer
>> process can't see it.
>> Due to that, even if synchronous_commit is changed to 'off' from 'on',
>> synchronous_transfer doesn't work.
>> I'm planning to modify the patch so that synchronous_transfer is not
>> affected by synchronous_commit.
>
> Hmm... when synchronous_transfer is set to data_flush,
> IMO the intuitive behaviors are
>
> (1) synchronous_commit = on
> A data flush should wait for the corresponding WAL to be
> flushed in the standby
>
> (2) synchronous_commit = remote_write
> A data flush should wait for the corresponding WAL to be
> written to OS in the standby.
>
> (3) synchronous_commit = local
> (4) synchronous_commit = off
> A data flush should wait for the corresponding WAL to be
> written locally in the master.
>

It is good idea.
So synchronous_commit value need to be visible from other process.
To share synchronous_commit with other process, I will try to put
synchronous_commit
value into shared buffer.
Is there already the guc parameter which is shared with other process?
I tried to find such parameter, but there was not it.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sameer Thakur

Date:

20 September 2013, 10:10:49

>Attached patch combines documentation patch and source-code patch.

I have had a stab at reviewing the documentation. Have a look.

--- a/doc/src/sgml/config.sgml

+++ b/doc/src/sgml/config.sgml

@@ -1749,6 +1749,50 @@ include 'filename'

</listitem>

</varlistentry>

+ <varlistentry id="guc-synchronous-transfer" xreflabel="synchronous_transfer">

+ <term><varname>synchronous_transfer</varname> (<type>enum</type>)</term>

+ <indexterm>

+ <primary><varname>synchronous_transfer</> configuration parameter</primary>

+ </indexterm>

+ <listitem>

+ <para>

+ This parameter controls the synchronous nature of WAL transfer and

+ maintains file system level consistency between master server and

+ standby server. It specifies whether master server will wait for file

+ system level change (for example : modifying data page) before

+ the corresponding WAL records are replicated to the standby server.

+ </para>

+ <para>

+ Valid values are <literal>commit</>, <literal>data_flush</> and

+ <literal>all</>. The default value is <literal>commit</>, meaning

+ that master will only wait for transaction commits, this is equivalent

+ to turning off <literal>synchronous_transfer</> parameter and standby

+ server will behave as a <quote>synchronous standby </> in

+ Streaming Replication. For value <literal>data_flush</>, master will

+ wait only for data page modifications but not for transaction

+ commits, hence the standby server will act as <quote>asynchronous

+ failback safe standby</>. For value <literal> all</>, master will wait

+ for data page modifications as well as for transaction commits and

+ resultant standby server will act as <quote>synchronous failback safe

+ standby</>.The wait is on background activities and hence will not create performance overhead.

+ To configure synchronous failback safe standby

+ <xref linkend="guc-synchronous-standby-names"> should be set.

+ </para>

+ </listitem>

+ </varlistentry>

@@ -2258,14 +2302,25 @@ include 'filename'</indexterm>

<para>

- Specifies a comma-separated list of standby names that can support

- <firstterm>synchronous replication</>, as described in

- <xref linkend="synchronous-replication">.

- At any one time there will be at most one active synchronous standby;

- transactions waiting for commit will be allowed to proceed after

- this standby server confirms receipt of their data.

- The synchronous standby will be the first standby named in this list

- that is both currently connected and streaming data in real-time

+ Specifies a comma-separated list of standby names. If this parameter

+ is set then standby will behave as synchronous standby in replication,

+ as described in <xref linkend="synchronous-replication"> or synchronous

+ failback safe standby, as described in <xref linkend="failback-safe">.

+ At any time there will be at most one active standby; when standby is

+ synchronous standby in replication, transactions waiting for commit

+ will be allowed to proceed after this standby server confirms receipt

+ of their data. But when standby is synchronous failback safe standby

+ data page modifications as well as transaction commits will be allowed

+ to proceed only after this standby server confirms receipt of their data.

+ If this parameter is set to empty value and

+ <xref linkend="guc-synchronous-transfer"> is set to <literal>data_flush</>

+ then standby is called as asynchronous failback safe standby and only

+ data page modifications will wait before corresponding WAL record is

+ replicated to standby.

+ </para>

+ <para>

+ Synchronous standby in replication will be the first standby named in

+ this list that is both currently connected and streaming data in real-time

(as shown by a state of <literal>streaming</literal> in the

<literal>pg_stat_replication</></link> view).

--- a/doc/src/sgml/high-availability.sgml

+++ b/doc/src/sgml/high-availability.sgml

+ <sect2 id="failback-safe">

+ <title>Setting up failback safe standby</title>

+ <indexterm zone="high-availability">

+ <primary>Setting up failback safe standby</primary>

+ </indexterm>

+ <para>

+ PostgreSQL streaming replication offers durability, but if the master crashes and

+a particular WAL record is unable to reach to standby server, then that

+WAL record is present on master server but not on standby server.

+In such a case master is ahead of standby server in term of WAL records and data in database.

+This leads to file-system level inconsistency between master and standby server.

+For example a heap page update on the master might not have been reflected on standby when master crashes.

+ </para>

+ <para>

+Due to this inconsistency, fresh backup of new master onto new standby is needed to re-prepare HA cluster.

+Taking fresh backup can be a very time consuming process when database is of large size. In such a case, disaster recovery

+can take very long time, if streaming replication is used to setup the high availability cluster.

+ </para>

+ <para>

+If HA cluster is configured with failback safe standby then this fresh back up can be avoided.

+The <xref linkend="guc-synchronous-transfer"> parameter has control over all WAL transfers and

+will not make any file system level change until master gets a confirmation from standby server.

+This avoids the need of a fresh backup by maintaining consistency.

+ </para>

+ <sect3 id="Failback-safe-config">

+ <title>Basic Configuration</title>

+ <para>

+ Failback safe standby can be asynchronous or synchronous in nature.

+ This will depend upon whether master will wait for transaction commit

+ or not. By default failback safe mechanism is turned off.

+ </para>

+ <para>

+ The first step to configure HA with failback safe standby is to setup

+ streaming replication. Configuring synchronous failback safe standby

+ requires setting up <xref linkend="guc-synchronous-transfer"> to

+ <literal>all</> and <xref linkend="guc-synchronous-standby-names">

+ must be set to a non-empty value. This configuration will cause each

+ commit and data page modification to wait for confirmation that standby

+ has written corresponding WAL record to durable storage. Configuring

+ asynchronous failback safe standby requires only setting up

+ <xref linkend="guc-synchronous-transfer"> to <literal> data_flush</>.

+ This configuration will cause only data page modifications to wait

+ for confirmation that standby has written corresponding WAL record

+ to durable storage.

+ </para>

+ </sect3>

+ </sect2>

</sect1>

</para>

<para>

- So, switching from primary to standby server can be fast but requires

- some time to re-prepare the failover cluster. Regular switching from

- primary to standby is useful, since it allows regular downtime on

- each system for maintenance. This also serves as a test of the

- failover mechanism to ensure that it will really work when you need it.

- Written administration procedures are advised.

+ At the time of failover there is a possibility of file-system level

+ inconsistency between the old primary and the old standby server and hence

+ a fresh backup from new master onto old master is needed for configuring

+ the old primary server as a new standby server. Without taking fresh

+ backup even if the new standby starts, streaming replication does not

+ start successfully. The activity of taking backup can be fast for smaller

+ databases but for a large database this activity requires more time to re-prepare the

+ failover cluster in streaming replication configuration of HA cluster.

+ This could break the service level agreement for crash

+ recovery. The need of fresh backup and problem of long

+ recovery time can be solved by using if HA cluster is configured with

+ failback safe standby see <xref linkend="failback-safe">.

+ Failback safe standby allows synchronous WAL transfer at required places

+ while maintaining the file-system level consistency between master and standby

+ server, without having backup to be taken on the old master.

+ </para>

+ <para>

+ Regular switching from primary to standby is useful, since it allows

+ regular downtime on each system for maintenance. This also serves as

+ a test of the failover mechanism to ensure that it will really work

+ when you need it. Written administration procedures are advised.

</para>

<para>

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml

index 2af1738..da3820f 100644

--- a/doc/src/sgml/perform.sgml

+++ b/doc/src/sgml/perform.sgml

</para>

</listitem>

+ <listitem>

+ <para>

+ Set <xref linkend="guc-synchronous-transfer"> to commit; there is no

+ need to guard against database inconsistency between master and standby during failover.

+ </para>

+ </listitem>

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

20 September 2013, 13:33:54

On Fri, Sep 20, 2013 at 3:40 PM, Sameer Thakur <samthakur74@gmail.com> wrote:

>Attached patch combines documentation patch and source-code patch.

I have had a stab at reviewing the documentation. Have a look.

Thanks.

Attached patch implements suggestions in documentation.

But comments from Fujii-san still needs to be implemented .

We will implement them soon.

Attachment

synchronous_transfer_v9.patch

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

22 September 2013, 14:57:03

On Fri, Sep 20, 2013 at 10:33 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
>
>
>
> On Fri, Sep 20, 2013 at 3:40 PM, Sameer Thakur <samthakur74@gmail.com>
> wrote:
>>
>>
>>
>>
>>>>
>>>
>>> >Attached patch combines documentation patch and source-code patch.
>>
>>
>> I have had a stab at reviewing the documentation. Have a look.
>>
>
> Thanks.
> Attached patch implements suggestions in documentation.
> But comments from Fujii-san still needs to be implemented .
> We will implement them soon.
>

I have attached the patch which modify based on Fujii-san suggested.

If synchronous_transfer is set 'data_flush', behaviour of
synchronous_transfer with synchronous_commit is

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

Even if user changes synchronous_commit value in transaction,
other process (e.g. checkpointer process) can't confirm it.
Currently patch, each processes uses locally synchronous_commit.

Regards,

-------
Sawada Masahiko

Attachment

synchronous_transfer_v10.patch

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

26 September 2013, 11:55:24

On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Hmm... when synchronous_transfer is set to data_flush,
IMO the intuitive behaviors are

(1) synchronous_commit = on
A data flush should wait for the corresponding WAL to be
flushed in the standby

(2) synchronous_commit = remote_write
A data flush should wait for the corresponding WAL to be
written to OS in the standby.

(3) synchronous_commit = local
(4) synchronous_commit = off
A data flush should wait for the corresponding WAL to be
written locally in the master.

I thought synchronous_commit and synchronous_transfer are kind of orthogonal to each other. synchronous_commit only controls whether and how to wait for the standby only when a transaction commits. synchronous_transfer OTOH tells how to interpret the standby listed in synchronous_standbys parameter. If set to "commit" then they are synchronous standbys (like today). If set to "data_flush", they are asynchronous failback safe standby and if set to "all" then they are synchronous failback safe standbys. Well, its confusing :-(

So IMHO in the current state of things, the synchronous_transfer GUC can not be changed at a session/transaction level since all backends, including background workers must honor the settings to guarantee failback safety. synchronous_commit still works the same way, but is ignored if synchronous_transfer is set to "data_flush" because that effectively tells us that the standbys listed under synchronous_standbys are really *async* standbys with failback safety.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

27 September 2013, 07:59:19

On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
> On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>>
>>
>> Hmm... when synchronous_transfer is set to data_flush,
>> IMO the intuitive behaviors are
>>
>> (1) synchronous_commit = on
>> A data flush should wait for the corresponding WAL to be
>> flushed in the standby
>>
>> (2) synchronous_commit = remote_write
>> A data flush should wait for the corresponding WAL to be
>> written to OS in the standby.
>>
>> (3) synchronous_commit = local
>> (4) synchronous_commit = off
>> A data flush should wait for the corresponding WAL to be
>> written locally in the master.
>
>
> I thought synchronous_commit and synchronous_transfer are kind of orthogonal
> to each other. synchronous_commit only controls whether and how to wait for
> the standby only when a transaction commits. synchronous_transfer OTOH tells
> how to interpret the standby listed in synchronous_standbys parameter. If
> set to "commit" then they are synchronous standbys (like today). If set to
> "data_flush", they are asynchronous failback safe standby and if set to
> "all" then they are synchronous failback safe standbys. Well, its confusing
> :-(
>
> So IMHO in the current state of things, the synchronous_transfer GUC can not
> be changed at a session/transaction level since all backends, including
> background workers must honor the settings to guarantee failback safety.
> synchronous_commit still works the same way, but is ignored if
> synchronous_transfer is set to "data_flush" because that effectively tells
> us that the standbys listed under synchronous_standbys are really *async*
> standbys with failback safety.
>

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

27 September 2013, 08:19:22

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

>

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":

- If synchronous_commit is ON - wait at all points

- If synchronous_commit is OFF - wait only at buffer flush (and other related to failback safety) points

synchronous_transfer is set to "data_flush":

- If synchronous_commit is either ON o OFF - do not wait at commit points, but wait at all other points

synchronous_transfer is set to "commit":

- If synchronous_commit is ON - wait at commit point

- If synchronous_commit is OFF - do not wait at any point

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

27 September 2013, 09:45:06

On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>>
>> >
>>
>> Thank you for comment. I think it is good simple idea.
>> In your opinion, if synchronous_transfer is set 'all' and
>> synchronous_commit is set 'on',
>> the master wait for data flush eve if user sets synchronous_commit to
>> 'local' or 'off'.
>> For example, when user want to do transaction early, user can't do this.
>> we leave the such situation as constraint?
>>
>
> No, user can still override the transaction commit point wait. So if
>
> synchronous_transfer is set to "all":
>  - If synchronous_commit is ON - wait at all points
>  - If synchronous_commit is OFF - wait only at buffer flush (and other
> related to failback safety) points
>
> synchronous_transfer is set to "data_flush":
>  - If synchronous_commit is either ON o OFF - do not wait at commit points,
> but wait at all other points
>
> synchronous_transfer is set to "commit":
>  - If synchronous_commit is ON - wait at commit point
>  - If synchronous_commit is OFF - do not wait at any point
>

Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

04 October 2013, 04:46:44

On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
> <pavan.deolasee@gmail.com> wrote:
>> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
>> wrote:
>>>
>>>
>>> >
>>>
>>> Thank you for comment. I think it is good simple idea.
>>> In your opinion, if synchronous_transfer is set 'all' and
>>> synchronous_commit is set 'on',
>>> the master wait for data flush eve if user sets synchronous_commit to
>>> 'local' or 'off'.
>>> For example, when user want to do transaction early, user can't do this.
>>> we leave the such situation as constraint?
>>>
>>
>> No, user can still override the transaction commit point wait. So if
>>
>> synchronous_transfer is set to "all":
>>  - If synchronous_commit is ON - wait at all points
>>  - If synchronous_commit is OFF - wait only at buffer flush (and other
>> related to failback safety) points
>>
>> synchronous_transfer is set to "data_flush":
>>  - If synchronous_commit is either ON o OFF - do not wait at commit points,
>> but wait at all other points
>>
>> synchronous_transfer is set to "commit":
>>  - If synchronous_commit is ON - wait at commit point
>>  - If synchronous_commit is OFF - do not wait at any point
>>
>
> Thank you for explain. Understood.
> if synchronous_transfer is set 'all' and user changes
> synchronous_commit to 'off'( or 'local') at a transaction,
> the master server wait at buffer flush, but doesn't wait at commit
> points. Right?
>
> In currently patch, synchronous_transfer works in cooperation with
> synchronous_commit.
> But if user changes synchronous_commit at a transaction, they are not
> in cooperation.
> So, your idea might be better than currently behaviour of synchronous_transfer.

I attached the v11 patch which have fixed following contents.
- synchronous_transfer controls to wait at only data flush level,
  synchronous_commit controls to wait at commit level. ( Based on
Pavan suggestion)
- If there are no sync replication standby name, both
synchronous_commit and synchronous_transfer
  don't work.
- Fixed that we didn't  support failback-safe standby.
  Previous patch can not support failback-safe standby. Because the
patch doesn't wait at FlushBuffer
  which is called by autovacuum.

So, if user want to do transaction early temporarily, user need to
change the synchronous_transfer value and reload
postgresql.conf.

Regards,

-------
Sawada Masahiko

Attachment

synchronous_transfer_v11.patch

Re: Patch for fail-back without fresh backup

From

Fujii Masao

Date:

04 October 2013, 07:32:25

On Fri, Oct 4, 2013 at 1:46 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
> On Fri, Sep 27, 2013 at 6:44 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:
>> On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
>> <pavan.deolasee@gmail.com> wrote:
>>> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko <sawada.mshk@gmail.com>
>>> wrote:
>>>>
>>>>
>>>> >
>>>>
>>>> Thank you for comment. I think it is good simple idea.
>>>> In your opinion, if synchronous_transfer is set 'all' and
>>>> synchronous_commit is set 'on',
>>>> the master wait for data flush eve if user sets synchronous_commit to
>>>> 'local' or 'off'.
>>>> For example, when user want to do transaction early, user can't do this.
>>>> we leave the such situation as constraint?
>>>>
>>>
>>> No, user can still override the transaction commit point wait. So if
>>>
>>> synchronous_transfer is set to "all":
>>>  - If synchronous_commit is ON - wait at all points
>>>  - If synchronous_commit is OFF - wait only at buffer flush (and other
>>> related to failback safety) points
>>>
>>> synchronous_transfer is set to "data_flush":
>>>  - If synchronous_commit is either ON o OFF - do not wait at commit points,
>>> but wait at all other points
>>>
>>> synchronous_transfer is set to "commit":
>>>  - If synchronous_commit is ON - wait at commit point
>>>  - If synchronous_commit is OFF - do not wait at any point
>>>
>>
>> Thank you for explain. Understood.
>> if synchronous_transfer is set 'all' and user changes
>> synchronous_commit to 'off'( or 'local') at a transaction,
>> the master server wait at buffer flush, but doesn't wait at commit
>> points. Right?
>>
>> In currently patch, synchronous_transfer works in cooperation with
>> synchronous_commit.
>> But if user changes synchronous_commit at a transaction, they are not
>> in cooperation.
>> So, your idea might be better than currently behaviour of synchronous_transfer.
>
> I attached the v11 patch which have fixed following contents.

You added several checks into SyncRepWaitForLSN() so that it can handle both
synchronous_transfer=data_flush and =commit. This change made the source code
of the function very complicated, I'm afraid. To simplify the source code,
what about just adding new wait-for-lsn function for data_flush instead of
changing SyncRepWaitForLSN()? Obviously that new function and
SyncRepWaitForLSN()
have the common part. I think that it should be extracted as separate function.

+     * Note that if sync transfer is requested, we can't regular
maintenance until
+     * standbys to connect.     */
-    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH &&
!SyncTransRequested())

Per discussion with Pavan, ISTM we don't need to avoid setting
synchronous_commit
to local even if synchronous_transfer is data_flush. But you did that here. Why?

When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked.
Is this safe?

+#synchronous_transfer = commit    # data page synchronization level
+                # commit, data_flush or all

This comment seems confusing. I think that this parameter specifies when to
wait for replication.

+typedef enum
+{
+    SYNCHRONOUS_TRANSFER_COMMIT,        /* no wait for flush data page */
+    SYNCHRONOUS_TRANSFER_DATA_FLUSH,    /* wait for data page flush only
+                                         * no wait for WAL */
+    SYNCHRONOUS_TRANSFER_ALL            /* wait for data page flush and WAL*/
+}    SynchronousTransferLevel;

These comments also seem confusing. For example, I think that the meaning of
SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on
transaction commit".

@@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)         */        XLogFlush(lsn);

+        /*
+         * If synchronous transfer is requested, wait for failback safe standby
+         * to receive WAL up to lsn.
+         */
+        if (SyncTransRequested())
+            SyncRepWaitForLSN(lsn, true, true);

If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need
to be called here.

Regards,

-- 
Fujii Masao

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

08 October 2013, 09:03:34

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> You added several checks into SyncRepWaitForLSN() so that it can handle both
> synchronous_transfer=data_flush and =commit. This change made the source code
> of the function very complicated, I'm afraid. To simplify the source code,
> what about just adding new wait-for-lsn function for data_flush instead of
> changing SyncRepWaitForLSN()? Obviously that new function and
> SyncRepWaitForLSN()
> have the common part. I think that it should be extracted as separate function.

Thank you for reviewing and comment!

yes I agree with you.
I attached the v12 patch which have modified based on above suggestions.
- Added new function SyncRepTransferWaitForLSN() and SyncRepWait()
  SyncRepTransferWaitForLSN() is called on date page flush. OTOH,
SyncRepWatiForLSN() is called on transaction commit.
  And both functions call the SyncRepWait() after check whether sync
commit/transfer is requested.
  Practically server will waits at SyncRepWait().

>
> +     * Note that if sync transfer is requested, we can't regular
> maintenance until
> +     * standbys to connect.
>       */
> -    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
> +    if (synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH &&
> !SyncTransRequested())
>
> Per discussion with Pavan, ISTM we don't need to avoid setting
> synchronous_commit
> to local even if synchronous_transfer is data_flush. But you did that here. Why?

I made a mistake. I have removed it.

>
> When synchronous_transfer = data_flush, anti-wraparound vacuum can be blocked.
> Is this safe?
>

In the new version patch, when synchronous_transfer = data_flush/all
AND synchronous_standby_names is set,
vacuum is blocked.
This behaviour of synchronous_transfer similar to synchronous_commit.
Should this allow to do anti-wraparound vacuum even if
synchronous_transfer = data_flush/all?
If so, it also allow to flush the data page while doing vacuum?

> +#synchronous_transfer = commit    # data page synchronization level
> +                # commit, data_flush or all
>
> This comment seems confusing. I think that this parameter specifies when to
> wait for replication.
>
> +typedef enum
> +{
> +    SYNCHRONOUS_TRANSFER_COMMIT,        /* no wait for flush data page */
> +    SYNCHRONOUS_TRANSFER_DATA_FLUSH,    /* wait for data page flush only
> +                                         * no wait for WAL */
> +    SYNCHRONOUS_TRANSFER_ALL            /* wait for data page flush and WAL*/
> +}    SynchronousTransferLevel;
>
> These comments also seem confusing. For example, I think that the meaning of
> SYNCHRONOUS_TRANSFER_COMMIT is something like "wait for replication on
> transaction commit".

Those comments are modified in new patch.

>
> @@ -521,6 +531,13 @@ smgr_redo(XLogRecPtr lsn, XLogRecord *record)
>           */
>          XLogFlush(lsn);
>
> +        /*
> +         * If synchronous transfer is requested, wait for failback safe standby
> +         * to receive WAL up to lsn.
> +         */
> +        if (SyncTransRequested())
> +            SyncRepWaitForLSN(lsn, true, true);
>
> If smgr_redo() is called only during recovery, SyncRepWaitForLSN() doesn't need
> to be called here.

Thank you for info.
I have removed it at smgr_redo().

Regards,

-------
Sawada Masahiko

Attachment

synchronous_transfer_v12.patch

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

08 October 2013, 09:37:26

On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
I attached the v12 patch which have modified based on above suggestions.

There are still some parts of this design/patch which I am concerned about.

1. The design clubs synchronous standby and failback safe standby rather very tightly. IIRC this is based on the feedback you received early, so my apologies for raising it again so late.

a. GUC synchrnous_standby_names is used to name synchronous as well as failback safe standbys. I don't know if that will confuse users.

b. synchronous_commit's value will also control whether a sync/async failback safe standby wait for remote write or flush. Is that reasonable ? Or should there be a different way to configure the failback safe standby's WAL safety ?

2. With the current design/implementation, user can't configure a synchronous and an async failback safe standby at the same time. I think we discussed this earlier and there was an agreement on the limitation. Just wanted to get that confirmed again.

3. SyncRepReleaseWaiters() does not know whether its waking up backends waiting for sync rep or failback safe rep. Is that ok ? For example, I found that the elog() message announcing next takeover emitted by the function may look bad. Since changing synchronous_transfer requires server restart, we can teach SyncRepReleaseWaiters() to look at that parameter to figure out whether the standby is sync and/or failback safe standby.

4. The documentation still need more work to clearly explain the use case.

5. Have we done any sort of stress testing of the patch ? If there is a bug, the data corruption at the master can go unnoticed. So IMHO we need many crash recovery tests to ensure that the patch is functionally correct.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

08 October 2013, 09:46:16

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:
> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:
> 
> > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > >
> > I attached the v12 patch which have modified based on above suggestions.
> >
> 
> There are still some parts of this design/patch which I am concerned about.
> 
> 1. The design clubs synchronous standby and failback safe standby rather
> very tightly. IIRC this is based on the feedback you received early, so my
> apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

08 October 2013, 10:01:16

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

IIRC you and Tom were particularly skeptical about the approach. But do you see a technical flaw or a show stopper with the approach ? Heikki has written pg_rewind which is really very cool. But it fails to handle the hint bit updates which are not WAL logged unless of course checksums are turned on. We can have a GUC controlled option to turn WAL logging on for hint bit updates and then use pg_rewind for the purpose. But I did not see any agreement on that either. Performance implication of WAL logging every hint bit update could be huge.

Simon has raised usability concerns that Sawada-san and Samrat have tried to address by following his suggestions. I am not fully convinced though we have got that right. But then there is hardly any feedback on that aspect lately.

In general, from the discussion it seems that the patch is trying to solve a real problem. Even though Tom and you feel that rsync is probably good enough and more trustworthy than any other approach, my feeling is that many including Fujii-san still disagree with that argument based on real user feedback. So where do we go from here ? I think it will really help Sawada-san and Samrat if we can provide them some solid feedback and approach to take.

Lately, I was thinking if we could do something else to track file system updates without relying on WAL inspection and then use pg_rewind to solve this problem. Some sort of prelaod library mechanism is one such possibility. But haven't really thought through this entirely.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

08 October 2013, 15:52:14

On 08.10.2013 13:00, Pavan Deolasee wrote:
> On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund<andres@2ndquadrant.com>wrote:
>
>> It is my impression that there still are several people having pretty
>> fundamental doubts about this approach in general. From what I remember
>> neither Heikki, Simon, Tom nor me were really convinced about this
>> approach.
>>
> IIRC you and Tom were particularly skeptical about the approach. But do you
> see a technical flaw or a show stopper with the approach ? Heikki has
> written pg_rewind which is really very cool. But it fails to handle the
> hint bit updates which are not WAL logged unless of course checksums are
> turned on. We can have a GUC controlled option to  turn WAL logging on for
> hint bit updates and then use pg_rewind for the purpose. But I did not see
> any agreement on that either. Performance implication of WAL logging every
> hint bit update could be huge.

Yeah, I definitely think we should work on the pg_rewind approach 
instead of this patch. It's a lot more flexible. The performance hit of 
WAL-logging hint bit updates is the price you have to pay, but a lot of 
people were OK with that to get page checksum, so I think a lot of 
people would be OK with it for this purpose too. As long as it's 
optional, of course. And anyone using page checksums are already paying 
that price.

- Heikki

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

09 October 2013, 06:30:43

On Tue, Oct 8, 2013 at 6:37 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
>
> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>
> wrote:
>>
>> On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> >
>> I attached the v12 patch which have modified based on above suggestions.
>
>
> There are still some parts of this design/patch which I am concerned about.
>
> 1. The design clubs synchronous standby and failback safe standby rather
> very tightly. IIRC this is based on the feedback you received early, so my
> apologies for raising it again so late.
>  a. GUC synchrnous_standby_names is used to name synchronous as well as
> failback safe standbys. I don't know if that will confuse users.

With currently the patch, user can specify the failback safe standby
and sync replication standby at same server.
synchronous_standby_names
So I was thinking that it will not confuse users.

>  b. synchronous_commit's value will also control whether a sync/async
> failback safe standby wait for remote write or flush. Is that reasonable ?
> Or should there be a different way to configure the failback safe standby's
> WAL safety ?

synchronous_commit's values can not control whether sync sync/async
failback safe standby wait level.
On data page flush, failback safe standby waits for only flush.
Should we also allow to wait for remote write?

>
> 2. With the current design/implementation, user can't configure a
> synchronous and an async failback safe standby at the same time. I think we
> discussed this earlier and there was an agreement on the limitation. Just
> wanted to get that confirmed again.
>

yes, user can't configure sync standby and async failback safe standby
at the same time.
The currently patch supports following cases.
- sync standby and make same as failback safe standby
- async standby and make same as failback safe standby

> 3. SyncRepReleaseWaiters() does not know whether its waking up backends
> waiting for sync rep or failback safe rep. Is that ok ? For example, I found
> that the elog() message announcing next takeover emitted by the function may
> look bad. Since changing synchronous_transfer requires server restart, we
> can teach SyncRepReleaseWaiters() to look at that parameter to figure out
> whether the standby is sync and/or failback safe standby.

I agree with you.
Are you saying about following comment?
if (announce_next_takeover)
{   announce_next_takeover = false;   ereport(LOG,           (errmsg("standby \"%s\" is now the synchronous standby
with priority %u",                   application_name,
MyWalSnd->sync_standby_priority)));
}

>
> 4. The documentation still need more work to clearly explain the use case.

Understood. we will more work to clearly explain the use case.

>
> 5. Have we done any sort of stress testing of the patch ? If there is a bug,
> the data corruption at the master can go unnoticed. So IMHO we need many
> crash recovery tests to ensure that the patch is functionally correct.
>

I have done several testing of the patch. And I have confirmed that
data page is not flushed to disk
when the master server has not been receive the reply from the standby
server. I used pg_filedump.
To ensure that the patch is functionally correct, what test should we do?

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

09 October 2013, 06:46:32

On Tue, Oct 8, 2013 at 6:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:
>> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:
>>
>> > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> > >
>> > I attached the v12 patch which have modified based on above suggestions.
>> >
>>
>> There are still some parts of this design/patch which I am concerned about.
>>
>> 1. The design clubs synchronous standby and failback safe standby rather
>> very tightly. IIRC this is based on the feedback you received early, so my
>> apologies for raising it again so late.
>
> It is my impression that there still are several people having pretty
> fundamental doubts about this approach in general. From what I remember
> neither Heikki, Simon, Tom nor me were really convinced about this
> approach.
>

Thank you for comment.
We are thinking that this approach can solve the real problem.
Actually we have confirm the effect of this approach. The master
server flushes data page to disk
after the master server received reply from the standby server.
If you have concern or doubt in technical side, Could you tell me it?

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Samrat Revagade

Date:

09 October 2013, 08:54:10

On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:
> On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko <sawada.mshk@gmail.com>wrote:
>
> > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > >
> > I attached the v12 patch which have modified based on above suggestions.
> >
>
> There are still some parts of this design/patch which I am concerned about.
>
> 1. The design clubs synchronous standby and failback safe standby rather
> very tightly. IIRC this is based on the feedback you received early, so my
> apologies for raising it again so late.

It is my impression that there still are several people having pretty
fundamental doubts about this approach in general. From what I remember
neither Heikki, Simon, Tom nor me were really convinced about this
approach.

Listing down all objections and their solutions:

Major Objection on the proposal:
* Tom Lane*
# additional complexity to the code it will cause performance overhead - On an average it causes 0.5 - 1% performance overhead for fast transaction workload, as the wait is mostly on backend process. The latest re-factored code, looks less complex.
# Use of rsync with checksum - but many pages on the two servers may differ in their binary values because of hint bits

*Heikki :*
# Use pg_rewind to do the same:
It has well known problem of hint bit updates.
If we use this we need enable checksums or explicitly WAL log hint bits which leads to performance overhead

*Amit Kapila*
# How to take care of extra WAL on old master during recovery.?
we can solve this by deleting all WAL file when old master before it starts as new standby.

*Simon Riggs*
# Renaming patch - done
# remove extra set of parameters - done
# performance drop - On an average it causes 0.5 - 1% performance overhead for fast transaction workload, as the wait is mostly on backend process.
# The way of configuring standby - with synchronous_transfer parameter we can configure 4 types of standby servers depending on the need.

*Fujii Masao*
# how patch interacts with cascaded standby - patch works same as synchronous replication
# CHECKPOINT in the standby, it got stuck infinitely. - fixed this
# Complicated conditions in SyncRepWaitForLSN() – code has been refactored in v11
# Improve source code comments - done

*Pavan Deolasee*
# Interaction of synchronous_commit with synchronous_transfer - Now synchronous_commit only controls whether and how
to wait for the standby only when a transaction commits. synchronous_transfer OTOH tells how to interpret the standby listed in
synchronous_standbys parameter.
# Further Improvements in the documentation - we will do that
# More stress testing - we will do that

Any inputs on stress testing would help.

Re: Patch for fail-back without fresh backup

From

Robert Haas

Date:

09 October 2013, 17:45:09

On Wed, Oct 9, 2013 at 4:54 AM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:
> On Tue, Oct 8, 2013 at 3:16 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
>> On 2013-10-08 15:07:02 +0530, Pavan Deolasee wrote:
>> > On Tue, Oct 8, 2013 at 2:33 PM, Sawada Masahiko
>> > <sawada.mshk@gmail.com>wrote:
>> >
>> > > On Fri, Oct 4, 2013 at 4:32 PM, Fujii Masao <masao.fujii@gmail.com>
>> > > wrote:
>> > > >
>> > > I attached the v12 patch which have modified based on above
>> > > suggestions.
>> > >
>> >
>> > There are still some parts of this design/patch which I am concerned
>> > about.
>> >
>> > 1. The design clubs synchronous standby and failback safe standby rather
>> > very tightly. IIRC this is based on the feedback you received early, so
>> > my
>> > apologies for raising it again so late.
>>
>> It is my impression that there still are several people having pretty
>> fundamental doubts about this approach in general. From what I remember
>> neither Heikki, Simon, Tom nor me were really convinced about this
>> approach.
>
>
>
> Listing down all objections and their solutions:
>
> Major Objection on the proposal:
> * Tom Lane*
> # additional complexity to the code it will cause performance overhead - On
> an average it causes 0.5 - 1% performance overhead for fast transaction
> workload, as the wait is mostly on backend process. The latest re-factored
> code, looks less complex.
> # Use of rsync with checksum - but many pages on the two servers may differ
> in their binary values because of hint bits
>
> *Heikki :*
> # Use pg_rewind to do the same:
> It has well known problem of hint bit updates.
> If we use this we need enable checksums or explicitly WAL log hint bits
> which leads to performance overhead
>
> *Amit Kapila*
> # How to take care of extra WAL on old master during recovery.?
> we can solve this by deleting all WAL file when old master before it starts
> as new standby.
>
> *Simon Riggs*
> # Renaming patch - done
> # remove extra set of parameters - done
> # performance drop - On an average it causes 0.5 - 1% performance overhead
> for fast transaction workload, as the wait is mostly on backend process.
> # The way of configuring standby - with synchronous_transfer parameter we
> can configure 4 types of standby servers depending on the need.
>
> *Fujii Masao*
> # how patch interacts with cascaded standby - patch works same as
> synchronous replication
> # CHECKPOINT in the standby, it got stuck infinitely. - fixed this
> # Complicated conditions in SyncRepWaitForLSN() – code has been refactored
> in v11
> # Improve source code comments - done
>
> *Pavan Deolasee*
>  # Interaction of synchronous_commit with synchronous_transfer - Now
> synchronous_commit only controls whether and how
>
> to wait for the standby only when a transaction commits.
> synchronous_transfer OTOH tells how to interpret the standby listed in
> synchronous_standbys parameter.
>  # Further Improvements in the documentation - we will do that
>  # More stress testing - we will do that
>
> Any inputs on stress testing would help.

The point is that when there are at least four senior community
members expressing serious objections to a concept, three of whom are
committes, we shouldn't be considering committing it until at least
some of those people have withdrawn your objections.  Nearly all patch
submitters are in favor of their own patches; that does not entitle
them to have those patches, committed even if there is a committer who
agrees with them.  There needs to be a real consensus on the path
forward.  If that policy ever changes, I have my own list of things
that are on the cutting-room floor that I'll be happy to resurrect.

Personally, I don't have a strong opinion on this patch because I have
not followed it closely enough.  But if Tom, Heikki, Simon, and Andres
are all unconvinced that this is a good direction, then put me down
for a -1 vote as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

10 October 2013, 04:41:50

On Tue, Oct 8, 2013 at 9:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Yeah, I definitely think we should work on the pg_rewind approach instead of this patch. It's a lot more flexible. The performance hit of WAL-logging hint bit updates is the price you have to pay, but a lot of people were OK with that to get page checksum, so I think a lot of people would be OK with it for this purpose too. As long as it's optional, of course. And anyone using page checksums are already paying that price.

Not that I can find any flaw in the OP's patch, but given the major objections and my own nervousness about documenting this new "failback safe" standby mode, I am also inclining to improve pg_rewind or whatever it takes to get it working. Clearly at first we need to have an optional mechanism to WAL log hint bit updates. There seems to be two ways to do that:

a. Add a new GUC which can turned on/off and requires server restart to take effect

b. Add another option for wal_level setting.

(b) looks better, but I am not sure if we want to support this new level with and without hot standby. If latter, we will need multiple new levels to differentiate all those cases. I am OK with supporting it only with hot standby which is probably what most people do with streaming replication anyway.

The other issue is to how to optimally WAL log hint bit updates:

a. Should we have separate WAL records just for the purpose or should we piggyback them on heap update/delete/prune etc WAL records ? Of course, there will be occasions when a simple SELECT also updates hint bits, so most likely we will need a separate WAL record anyhow.

b. Does it make sense to try to all hint bits in a page if we are WAL logging it anyways ? I think we have discussed this idea even before just to minimize the number of writes a heap page receives when hint bits of different tuples are set at different times, each update triggering a fresh write. I don't remember whats the consensus for that, but it might be worthwhile to reconsider that option if we are WAL logging the hint bit updates.

We will definitely need some amount of performance benchmarks even if this is optional. But are there other things to worry about ? Any strong objections to this idea or any other stow stopper for pg_rewind itself ?

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

21 October 2013, 13:40:50

On Thu, Oct 10, 2013 at 1:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
>
>
> Not that I can find any flaw in the OP's patch, but given the major
> objections and my own nervousness about documenting this new "failback safe"
> standby mode, I am also inclining to improve pg_rewind or whatever it takes
> to get it working. Clearly at first we need to have an optional mechanism to
> WAL log hint bit updates. There seems to be two ways to do that:
>
> a. Add a new GUC which can turned on/off and requires server restart to take
> effect
> b. Add another option for wal_level setting.
>
> (b) looks better, but I am not sure if we want to support this new level
> with and without hot standby. If latter, we will need multiple new levels to
> differentiate all those cases. I am OK with supporting it only with hot
> standby which is probably what most people do with streaming replication
> anyway.
>
> The other issue is to how to optimally WAL log hint bit updates:
>
> a. Should we have separate WAL records just for the purpose or should we
> piggyback them on heap update/delete/prune etc WAL records ? Of course,
> there will be occasions when a simple SELECT also updates hint bits, so most
> likely we will need a separate WAL record anyhow.
> b. Does it make sense to try to all hint bits in a page if we are WAL
> logging it anyways ? I think we have discussed this idea even before just to
> minimize the number of writes a heap page receives when hint bits of
> different tuples are set at different times, each update triggering a fresh
> write. I don't remember whats the consensus for that, but it might be
> worthwhile to reconsider that option if we are WAL logging the hint bit
> updates.

I agree with you.
If writing FPW is not large performance degradation, it is just idea
that we can use to write FPW in same timing as checksum enabled.
i.g., if we support new wal_level, the system writes FPW when a simple
SELECT updates hint bits. but checksum function is disabled.
Thought?

>
> We will definitely need some amount of performance benchmarks even if this
> is optional. But are there other things to worry about ? Any strong
> objections to this idea or any other stow stopper for pg_rewind itself ?
>


-- 
Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

24 October 2013, 10:02:51

On Mon, Oct 21, 2013 at 7:10 PM, Sawada Masahiko <sawada.mshk@gmail.com> wrote:

I agree with you.
If writing FPW is not large performance degradation, it is just idea
that we can use to write FPW in same timing as checksum enabled.
i.g., if we support new wal_level, the system writes FPW when a simple
SELECT updates hint bits. but checksum function is disabled.
Thought?

I wonder if its too much for this purpose. In fact, we just need a way to know that a block could have been written on the master which the standby never saw. So even WAL logging just the block id should be good enough for pg_rewind to be able to detect and later copy that block from the new master. Having said that, I don't know if there is general advantage of WAL logging the exact hint bit update operation for other reasons.

Another difference AFAICS is that checksum feature needs the block to be backed up only after the first time a hint bit is updated after checkpoint. But for something like pg_rewind to work, we will need to WAL log every hint bit update on a page. So we would want to keep it as short as possible.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

24 October 2013, 10:52:37

On 24.10.2013 13:02, Pavan Deolasee wrote:
> Another difference AFAICS is that checksum feature needs the block to be
> backed up only after the first time a hint bit is updated after checkpoint.
> But for something like pg_rewind to work, we will need to WAL log every
> hint bit update on a page. So we would want to keep it as short as possible.

To fix that, pg_rewind could always start the rewinding process from the 
last checkpoint before the point that the histories diverge, instead of 
the exact point of divergence. That would make the rewinding more 
expensive as it needs to read through a lot more WAL, but I think it 
would still be OK.

- Heikki

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

24 October 2013, 11:16:12

On Thu, Oct 24, 2013 at 4:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

.

To fix that, pg_rewind could always start the rewinding process from the last checkpoint before the point that the histories diverge, instead of the exact point of divergence.

Is that something required even if someone plans to use pg_rewind for a cluster with checksums enabled ? I mean since only first update after checkpoint is WAL logged, pg_rewind will break if another update happens after standby forks. Or would the recovery logic apply first WAL without looking at the page lsn ? (Sorry, may be I should read the code instead of asking you)

If we do what you are suggesting, it seems like a single line patch to me. In XLogSaveBufferForHint(), we probably need to look at this additional GUC to decide whether or not to backup the block.

That would make the rewinding more expensive as it needs to read through a lot more WAL, but I think it would still be OK.

Yeah, probably you are right. Though the amount of additional work could be significantly higher and some testing might be warranted.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

24 October 2013, 12:03:48

On Thu, Oct 24, 2013 at 4:45 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

. Or would the recovery logic apply first WAL without looking at the page lsn ? (Sorry, may be I should read the code instead of asking you)

Never mind. I realized it has to. That's the whole purpose of backing it up in the first place.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

24 October 2013, 12:16:09

On 24.10.2013 14:15, Pavan Deolasee wrote:
> On Thu, Oct 24, 2013 at 4:22 PM, Heikki Linnakangas<hlinnakangas@vmware.com
>> wrote:
>
>> To fix that, pg_rewind could always start the rewinding process from the
>> last checkpoint before the point that the histories diverge, instead of the
>> exact point of divergence.
>
> Is that something required even if someone plans to use pg_rewind for a
> cluster with checksums enabled ? I mean since only first update after
> checkpoint is WAL logged, pg_rewind will break if another update happens
> after standby forks.

Yes. It's broken as it is, even when checksums are enabled - good catch. 
I'll go change it to read all the WAL in the target starting from the 
last checkpoint before the point of divergence.

> Or would the recovery logic apply first WAL without
> looking at the page lsn ? (Sorry, may be I should read the code instead of
> asking you)

WAL recovery does apply all the full-page images without looking at the 
page LSN, but that doesn't help in this case. pg_rewind copies over the 
blocks from the source server (= promoted standby) that were changed in 
the target server (= old master), after the standby's history diverged 
from it. In other words, it reverts the blocks that were changed in the 
old master, by copying them over from the promoted standby. After that, 
WAL recovery is performed, using the WAL from the promoted standby, to 
apply all the changes from the promoted standby that were not present in 
the old master. But it never replays any WAL from the old master. It 
reads it through, to construct the list of blocks that were modified, 
but it doesn't apply them.

> If we do what you are suggesting, it seems like a single line patch to me.
> In XLogSaveBufferForHint(), we probably need to look at this additional GUC
> to decide whether or not to backup the block.

Yeah, it's trivial to add such a guc. Will just have to figure out what 
we want the user interface to be like; should it be a separate guc, or 
somehow cram it into wal_level?

- Heikki

Re: Patch for fail-back without fresh backup

From

Pavan Deolasee

Date:

24 October 2013, 12:41:28

On Thu, Oct 24, 2013 at 5:45 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Will just have to figure out what we want the user interface to be like; should it be a separate guc, or somehow cram it into wal_level?

Yeah, I had brought up similar idea up thread. Right now wal_level is nicely ordered. But with this additional logic, I am not sure if we would need multiple new levels and also break that ordering (I don't know if its important). For example, one may want to set up streaming replication with/without this feature or hot standby with/without the feature. I don't have a good idea about how to capture them in wal_level. May be something like: minimal, archive, archive_with_this_new_feature, hot_standby and hot_standby_with_this_new_feature.

Thanks,

Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: Patch for fail-back without fresh backup

From

Alvaro Herrera

Date:

24 October 2013, 14:32:00

Pavan Deolasee escribió:

> Yeah, I had brought up similar idea up thread. Right now wal_level is
> nicely ordered. But with this additional logic, I am not sure if we would
> need multiple new levels and also break that ordering (I don't know if its
> important). For example, one may want to set up streaming replication
> with/without this feature or hot standby with/without the feature. I don't
> have a good idea about how to capture them in wal_level. May be something
> like: minimal, archive, archive_with_this_new_feature, hot_standby and
> hot_standby_with_this_new_feature.

That's confusing.  A separate GUC sounds better.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Josh Berkus

Date:

24 October 2013, 17:39:40

On 10/24/2013 04:15 AM, Pavan Deolasee wrote:
> If we do what you are suggesting, it seems like a single line patch to me.
> In XLogSaveBufferForHint(), we probably need to look at this additional GUC
> to decide whether or not to backup the block.

Wait, what?  Why are we having an additional GUC?

I'm opposed to the idea of having a GUC to enable failback.  When would
anyone using replication ever want to disable that?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

24 October 2013, 18:12:39

On 24.10.2013 20:39, Josh Berkus wrote:
> On 10/24/2013 04:15 AM, Pavan Deolasee wrote:
>> If we do what you are suggesting, it seems like a single line patch to me.
>> In XLogSaveBufferForHint(), we probably need to look at this additional GUC
>> to decide whether or not to backup the block.
>
> Wait, what?  Why are we having an additional GUC?
>
> I'm opposed to the idea of having a GUC to enable failback.  When would
> anyone using replication ever want to disable that?

For example, if you're not replicating for high availability purposes, 
but to keep a reporting standby up-to-date.

- Heikki

Re: Patch for fail-back without fresh backup

From

Josh Berkus

Date:

24 October 2013, 20:08:09

On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:
> On 24.10.2013 20:39, Josh Berkus wrote:
>> On 10/24/2013 04:15 AM, Pavan Deolasee wrote:
>>> If we do what you are suggesting, it seems like a single line patch
>>> to me.
>>> In XLogSaveBufferForHint(), we probably need to look at this
>>> additional GUC
>>> to decide whether or not to backup the block.
>>
>> Wait, what?  Why are we having an additional GUC?
>>
>> I'm opposed to the idea of having a GUC to enable failback.  When would
>> anyone using replication ever want to disable that?
> 
> For example, if you're not replicating for high availability purposes,
> but to keep a reporting standby up-to-date.

What kind of overhead are we talking about here?  You probably said, but
I've had a mail client meltdown and lost a lot of my -hackers emails.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Patch for fail-back without fresh backup

From

Heikki Linnakangas

Date:

24 October 2013, 20:14:35

On 24.10.2013 23:07, Josh Berkus wrote:
> On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:
>> On 24.10.2013 20:39, Josh Berkus wrote:
>>> On 10/24/2013 04:15 AM, Pavan Deolasee wrote:
>>>> If we do what you are suggesting, it seems like a single line patch
>>>> to me.
>>>> In XLogSaveBufferForHint(), we probably need to look at this
>>>> additional GUC
>>>> to decide whether or not to backup the block.
>>>
>>> Wait, what?  Why are we having an additional GUC?
>>>
>>> I'm opposed to the idea of having a GUC to enable failback.  When would
>>> anyone using replication ever want to disable that?
>>
>> For example, if you're not replicating for high availability purposes,
>> but to keep a reporting standby up-to-date.
>
> What kind of overhead are we talking about here?  You probably said, but
> I've had a mail client meltdown and lost a lot of my -hackers emails.

One extra WAL record whenever a hint bit is set on a page, for the first 
time after a checkpoint. In other words, a WAL record needs to be 
written in the same circumstances as with page checksums, but the WAL 
records are much smaller as they don't need to contain a full page 
image, just the block number of the changed block.

Or maybe we'll write the full page image after all, like with page 
checksums, just without calculating the checksums. It might be tricky to 
skip the full-page image, because then a subsequent change of the page 
(which isn't just a hint-bit update) needs to somehow know it needs to 
take a full page image even though a WAL record for it was already written.

- Heikki

Re: Patch for fail-back without fresh backup

From

Josh Berkus

Date:

24 October 2013, 20:52:01

On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:
> One extra WAL record whenever a hint bit is set on a page, for the first
> time after a checkpoint. In other words, a WAL record needs to be
> written in the same circumstances as with page checksums, but the WAL
> records are much smaller as they don't need to contain a full page
> image, just the block number of the changed block.
> 
> Or maybe we'll write the full page image after all, like with page
> checksums, just without calculating the checksums. It might be tricky to
> skip the full-page image, because then a subsequent change of the page
> (which isn't just a hint-bit update) needs to somehow know it needs to
> take a full page image even though a WAL record for it was already written.

I think it would be worth estimating what this actually looks like in
terms of log write quantity.  My inclication is to say that if it
increases log writes less than 10%, we don't need to provide an option
to turn it off.

The reasons I don't want to provide a disabling GUC are:
a) more GUCs
b) confusing users
c) causing users to disable rewind *until they need it*, at which point
it's too late to enable it.

So if there's any way we can avoid having a GUC for this, I'm for it.
And if we do have a GUC, failback should be enabled by default.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: Patch for fail-back without fresh backup

From

Magnus Hagander

Date:

24 October 2013, 20:57:35

On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:
>> One extra WAL record whenever a hint bit is set on a page, for the first
>> time after a checkpoint. In other words, a WAL record needs to be
>> written in the same circumstances as with page checksums, but the WAL
>> records are much smaller as they don't need to contain a full page
>> image, just the block number of the changed block.
>>
>> Or maybe we'll write the full page image after all, like with page
>> checksums, just without calculating the checksums. It might be tricky to
>> skip the full-page image, because then a subsequent change of the page
>> (which isn't just a hint-bit update) needs to somehow know it needs to
>> take a full page image even though a WAL record for it was already written.
>
> I think it would be worth estimating what this actually looks like in
> terms of log write quantity.  My inclication is to say that if it
> increases log writes less than 10%, we don't need to provide an option
> to turn it off.
>
> The reasons I don't want to provide a disabling GUC are:
> a) more GUCs
> b) confusing users
> c) causing users to disable rewind *until they need it*, at which point
> it's too late to enable it.
>
> So if there's any way we can avoid having a GUC for this, I'm for it.
> And if we do have a GUC, failback should be enabled by default.

+1 on the principle.

In fact I've been considering suggesting we might want to retire the
difference between archive and hot_standby as wal_level, because the
difference is usually so small. And the advantage of hot_standby is in
almost every case worth it. Even in the archive recovery mode, being
able to do pause_at_recovery_target is extremely useful. And as you
say in (c) above, many users don't realize that until it's too late.

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

25 October 2013, 02:14:43

On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Thu, Oct 24, 2013 at 10:51 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:
>>
>> I think it would be worth estimating what this actually looks like in
>> terms of log write quantity.  My inclication is to say that if it
>> increases log writes less than 10%, we don't need to provide an option
>> to turn it off.
>>
>> The reasons I don't want to provide a disabling GUC are:
>> a) more GUCs
>> b) confusing users
>> c) causing users to disable rewind *until they need it*, at which point
>> it's too late to enable it.
>>
>> So if there's any way we can avoid having a GUC for this, I'm for it.
>> And if we do have a GUC, failback should be enabled by default.
>
> +1 on the principle.
>
> In fact I've been considering suggesting we might want to retire the
> difference between archive and hot_standby as wal_level, because the
> difference is usually so small. And the advantage of hot_standby is in
> almost every case worth it. Even in the archive recovery mode, being
> able to do pause_at_recovery_target is extremely useful. And as you
> say in (c) above, many users don't realize that until it's too late.
>

+1.

Many user would not realize that it is too late If we will provide it
as additional GUC.
And I agree with writing log including the block number of the changed block.
I think that writing log is not lead huge overhead increase.
Is those WAL record replicated to the standby server in synchronous (
of course when configuring sync replication)?
I am concerned that it lead performance overhead with such as
executing SELECT or auto vacuum. especially, when two servers are in
far location.


Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Michael Paquier

Date:

25 October 2013, 06:16:34

On Fri, Oct 25, 2013 at 5:57 AM, Magnus Hagander <magnus@hagander.net> wrote:
> In fact I've been considering suggesting we might want to retire the
> difference between archive and hot_standby as wal_level, because the
> difference is usually so small. And the advantage of hot_standby is in
> almost every case worth it. Even in the archive recovery mode, being
> able to do pause_at_recovery_target is extremely useful. And as you
> say in (c) above, many users don't realize that until it's too late.
+1 on removing archive from wal_level. Having both archive and
hot_standby for wal_level is confusing, and if I recall correctly
hot_standby and archive have been kept as possible settings only to
protect people from bugs that the newly-introduced hot_standby could
introduce due to the few WAL records it adds. But it has been a couple
of releases since there have been no such bugs, no?
-- 
Michael

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

25 October 2013, 11:09:00

On 2013-10-24 13:51:52 -0700, Josh Berkus wrote:
> On 10/24/2013 01:14 PM, Heikki Linnakangas wrote:
> > One extra WAL record whenever a hint bit is set on a page, for the first
> > time after a checkpoint. In other words, a WAL record needs to be
> > written in the same circumstances as with page checksums, but the WAL
> > records are much smaller as they don't need to contain a full page
> > image, just the block number of the changed block.
> > 
> > Or maybe we'll write the full page image after all, like with page
> > checksums, just without calculating the checksums. It might be tricky to
> > skip the full-page image, because then a subsequent change of the page
> > (which isn't just a hint-bit update) needs to somehow know it needs to
> > take a full page image even though a WAL record for it was already written.
> 
> I think it would be worth estimating what this actually looks like in
> terms of log write quantity.  My inclication is to say that if it
> increases log writes less than 10%, we don't need to provide an option
> to turn it off.

It entirely depends on your workload. If it happens to be something
like:
INSERT INTO table (lots_of_data);
CHECKPOINT;
SELECT * FROM TABLE;

i.e. there's a checkpoint between loading the data and reading it - not
exactly all that uncommon - we'll need to log something for every
page. That can be rather noticeable. Especially as I think it will be
rather hard to log anything but a real FPI.

I really don't think everyone will want this. I am absolutely not
against providing an option to log enough information to make pg_rewind
work, but I think providing a command to do *safe* *planned* failover
will help in many more.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

25 October 2013, 11:29:26

On 2013-10-24 22:57:29 +0200, Magnus Hagander wrote:
> In fact I've been considering suggesting we might want to retire the
> difference between archive and hot_standby as wal_level, because the
> difference is usually so small. And the advantage of hot_standby is in
> almost every case worth it. Even in the archive recovery mode, being
> able to do pause_at_recovery_target is extremely useful. And as you
> say in (c) above, many users don't realize that until it's too late.

+1.

On 2013-10-25 15:16:30 +0900, Michael Paquier wrote:
> But it has been a couple of releases since there have been no such
> bugs, no?

One 'no' too much? Anyway, I think there have been more recent ones, but
it's infrequent enough that we can remove the level anyway.

FWIW, I've wondered if we shouldn't remove most of the EnableHotStandby
checks in xlog.c. There are way too many difference how StartupXLOG
behaves depending on HS.
E.g. I quite dislike that we do stuff like StartupCLOG at entirely
different times during recovery.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Sawada Masahiko

Date:

02 November 2013, 15:25:57

On Fri, Oct 25, 2013 at 8:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-10-24 13:51:52 -0700, Josh Berkus wrote:
>
> It entirely depends on your workload. If it happens to be something
> like:
> INSERT INTO table (lots_of_data);
> CHECKPOINT;
> SELECT * FROM TABLE;
>
> i.e. there's a checkpoint between loading the data and reading it - not
> exactly all that uncommon - we'll need to log something for every
> page. That can be rather noticeable. Especially as I think it will be
> rather hard to log anything but a real FPI.
>
> I really don't think everyone will want this. I am absolutely not
> against providing an option to log enough information to make pg_rewind
> work, but I think providing a command to do *safe* *planned* failover
> will help in many more.
>

I think it is better providing as option to log enough information
such as new wal_level.
If user doesn't realize until it's too late, such information is
contained in checkpoint record?
For example if checkpoint record contained information of wal_level
then we can inform to user
using by such information.

BTW this information is useful only for pg_rewind?
Is there for anything else?
(Sorry it might has already been discussed..)

Regards,

-------
Sawada Masahiko

Re: Patch for fail-back without fresh backup

From

Bruce Momjian

Date:

12 November 2013, 16:02:01

On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:
> On 24.10.2013 23:07, Josh Berkus wrote:
> >On 10/24/2013 11:12 AM, Heikki Linnakangas wrote:
> >>On 24.10.2013 20:39, Josh Berkus wrote:
> >>>On 10/24/2013 04:15 AM, Pavan Deolasee wrote:
> >>>>If we do what you are suggesting, it seems like a single line patch
> >>>>to me.
> >>>>In XLogSaveBufferForHint(), we probably need to look at this
> >>>>additional GUC
> >>>>to decide whether or not to backup the block.
> >>>
> >>>Wait, what?  Why are we having an additional GUC?
> >>>
> >>>I'm opposed to the idea of having a GUC to enable failback.  When would
> >>>anyone using replication ever want to disable that?
> >>
> >>For example, if you're not replicating for high availability purposes,
> >>but to keep a reporting standby up-to-date.
> >
> >What kind of overhead are we talking about here?  You probably said, but
> >I've had a mail client meltdown and lost a lot of my -hackers emails.
> 
> One extra WAL record whenever a hint bit is set on a page, for the
> first time after a checkpoint. In other words, a WAL record needs to
> be written in the same circumstances as with page checksums, but the
> WAL records are much smaller as they don't need to contain a full
> page image, just the block number of the changed block.
> 
> Or maybe we'll write the full page image after all, like with page
> checksums, just without calculating the checksums. It might be
> tricky to skip the full-page image, because then a subsequent change
> of the page (which isn't just a hint-bit update) needs to somehow
> know it needs to take a full page image even though a WAL record for
> it was already written.

Sorry to be replying late to this, but while I am not worried about the
additional WAL volume, does this change require the transaction to now
wait for a WAL sync to disk before continuing?  I thought that was the
down-side to WAL logging hint bits, not the WAL volume itself.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Patch for fail-back without fresh backup

From

Alvaro Herrera

Date:

21 November 2013, 20:31:50

Bruce Momjian escribió:
> On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:
> > On 24.10.2013 23:07, Josh Berkus wrote:

> > >What kind of overhead are we talking about here?
> > 
> > One extra WAL record whenever a hint bit is set on a page, for the
> > first time after a checkpoint. In other words, a WAL record needs to
> > be written in the same circumstances as with page checksums, but the
> > WAL records are much smaller as they don't need to contain a full
> > page image, just the block number of the changed block.
> > 
> > Or maybe we'll write the full page image after all, like with page
> > checksums, just without calculating the checksums. It might be
> > tricky to skip the full-page image, because then a subsequent change
> > of the page (which isn't just a hint-bit update) needs to somehow
> > know it needs to take a full page image even though a WAL record for
> > it was already written.
> 
> Sorry to be replying late to this, but while I am not worried about the
> additional WAL volume, does this change require the transaction to now
> wait for a WAL sync to disk before continuing?

I don't think so.  There's extra WAL written, but there's no
flush-and-wait until end of transaction (as has always been).

> I thought that was the down-side to WAL logging hint bits, not the WAL
> volume itself.

I don't think this is true either.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Patch for fail-back without fresh backup

From

Jeff Janes

Date:

21 November 2013, 22:40:41

On Thu, Nov 21, 2013 at 12:31 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Bruce Momjian escribió:
> On Thu, Oct 24, 2013 at 11:14:14PM +0300, Heikki Linnakangas wrote:
> > On 24.10.2013 23:07, Josh Berkus wrote:

> > >What kind of overhead are we talking about here?
> >
> > One extra WAL record whenever a hint bit is set on a page, for the
> > first time after a checkpoint. In other words, a WAL record needs to
> > be written in the same circumstances as with page checksums, but the
> > WAL records are much smaller as they don't need to contain a full
> > page image, just the block number of the changed block.
> >
> > Or maybe we'll write the full page image after all, like with page
> > checksums, just without calculating the checksums. It might be
> > tricky to skip the full-page image, because then a subsequent change
> > of the page (which isn't just a hint-bit update) needs to somehow
> > know it needs to take a full page image even though a WAL record for
> > it was already written.
>
> Sorry to be replying late to this, but while I am not worried about the
> additional WAL volume, does this change require the transaction to now
> wait for a WAL sync to disk before continuing?

I don't think so. There's extra WAL written, but there's no
flush-and-wait until end of transaction (as has always been).

But if the transaction would not have otherwise generated WAL (i.e. a select that did not have to do any HOT pruning, or an update with zero rows matching the where condition), doesn't it now have to flush and wait when it would otherwise not?

Cheers,

Jeff

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

21 November 2013, 22:43:40

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> But if the transaction would not have otherwise generated WAL (i.e. a
> select that did not have to do any HOT pruning, or an update with zero rows
> matching the where condition), doesn't it now have to flush and wait when
> it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Bruce Momjian

Date:

22 November 2013, 22:46:26

On Thu, Nov 21, 2013 at 11:43:34PM +0100, Andres Freund wrote:
> On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> > But if the transaction would not have otherwise generated WAL (i.e. a
> > select that did not have to do any HOT pruning, or an update with zero rows
> > matching the where condition), doesn't it now have to flush and wait when
> > it would otherwise not?
> 
> We short circuit that if there's no xid assigned. Check
> RecordTransactionCommit().

OK, that was my question, now answered.  Thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Patch for fail-back without fresh backup

From

Jeff Janes

Date:

16 January 2014, 17:26:01

On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> But if the transaction would not have otherwise generated WAL (i.e. a
> select that did not have to do any HOT pruning, or an update with zero rows
> matching the where condition), doesn't it now have to flush and wait when
> it would otherwise not?

We short circuit that if there's no xid assigned. Check
RecordTransactionCommit().

It looks like that only short-circuits the flush if both there is no xid assigned, and !wrote_xlog. (line 1054 of xact.c)

I do see stalls on fdatasync on flush from select statements which had no xid, but did generate xlog due to HOT pruning, I don't see why WAL logging hint bits would be different.

Cheers,

Jeff

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

16 January 2014, 17:38:01

On 2014-01-16 09:25:51 -0800, Jeff Janes wrote:
> On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote:
> 
> > On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> > > But if the transaction would not have otherwise generated WAL (i.e. a
> > > select that did not have to do any HOT pruning, or an update with zero
> > rows
> > > matching the where condition), doesn't it now have to flush and wait when
> > > it would otherwise not?
> >
> > We short circuit that if there's no xid assigned. Check
> > RecordTransactionCommit().
> >
> 
> It looks like that only short-circuits the flush if both there is no xid
> assigned, and !wrote_xlog.  (line 1054 of xact.c)

Hm. Indeed. Why don't we just always use the async commit behaviour for
that? I don't really see any significant dangers from doing so?

It's also rather odd to use the sync rep mechanisms in such
scenarios... The if() really should test markXidCommitted instead of
wrote_xlog.

> I do see stalls on fdatasync on flush from select statements which had no
> xid, but did generate xlog due to HOT pruning, I don't see why WAL logging
> hint bits would be different.

Are the stalls at commit or while the select is running? If wal_buffers
is filled too fast, which can easily happen if loads of pages are hinted
and wal logged, that will happen independently from
RecordTransactionCommit().

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Jeff Janes

Date:

16 January 2014, 19:01:35

On Thu, Jan 16, 2014 at 9:37 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-01-16 09:25:51 -0800, Jeff Janes wrote:
> On Thu, Nov 21, 2013 at 2:43 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2013-11-21 14:40:36 -0800, Jeff Janes wrote:
> > > But if the transaction would not have otherwise generated WAL (i.e. a
> > > select that did not have to do any HOT pruning, or an update with zero
> > rows
> > > matching the where condition), doesn't it now have to flush and wait when
> > > it would otherwise not?
> >
> > We short circuit that if there's no xid assigned. Check
> > RecordTransactionCommit().
> >
>
> It looks like that only short-circuits the flush if both there is no xid
> assigned, and !wrote_xlog. (line 1054 of xact.c)

Hm. Indeed. Why don't we just always use the async commit behaviour for
that? I don't really see any significant dangers from doing so?

I think the argument is that drawing the next value from a sequence can generate xlog that needs to be flushed, but doesn't assign an xid.

I would think the sequence should flush that record before it hands out the value, not before the commit, but...

It's also rather odd to use the sync rep mechanisms in such
scenarios... The if() really should test markXidCommitted instead of
wrote_xlog.

> I do see stalls on fdatasync on flush from select statements which had no
> xid, but did generate xlog due to HOT pruning, I don't see why WAL logging
> hint bits would be different.

Are the stalls at commit or while the select is running? If wal_buffers
is filled too fast, which can easily happen if loads of pages are hinted
and wal logged, that will happen independently from
RecordTransactionCommit().

In the real world, I'm not sure what the distribution is.

But in my present test case, they are coming almost exclusively from RecordTransactionCommit.

I use "pgbench -T10" in a loop to generate dirty data and checkpoints (with synchronous_commit on but with a BBU), and then to probe the consequences I use:

pgbench -T10 -S -n --startup='set synchronous_commit='$f

(where --startup is an extension to pgbench proposed a few months ago)

Running the select-only query with synchronous_commit off almost completely isolates it from the checkpoint drama that otherwise has a massive effect on it. with synchronous_commit=on, it goes from 6000 tps normally to 30 tps during the checkpoint sync, with synchronous_commit=off it might dip to 4000 or so during the worst of it.

(To be clear, this is about the pruning, not the logging of the hint bits)

Cheers,

Jeff

Re: Patch for fail-back without fresh backup

From

Andres Freund

Date:

16 January 2014, 19:05:40

On 2014-01-16 11:01:29 -0800, Jeff Janes wrote:
> I think the argument is that drawing the next value from a sequence can
> generate xlog that needs to be flushed, but doesn't assign an xid.

Then that should assign an xid. Which would yield correct behaviour with
async commit where it's currently *not* causing a WAL flush at all
unless a page boundary is crossed.

I've tried arguing that way before...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Patch for fail-back without fresh backup

From

Tom Lane

Date:

16 January 2014, 19:16:12

Jeff Janes <jeff.janes@gmail.com> writes:
> I think the argument is that drawing the next value from a sequence can
> generate xlog that needs to be flushed, but doesn't assign an xid.

> I would think the sequence should flush that record before it hands out the
> value, not before the commit, but...

IIRC the argument was that we'd flush WAL before any use of the value
could make it to disk.  Which is true if you're just inserting it into
a table; perhaps less so if the client is doing something external to
the database with it.  (But it'd be reasonable to say that clients
who want a guaranteed-good serial for such purposes should have to
commit the transaction that created the value.)
        regards, tom lane